Here are some notes on decisions I have made at different stages in this notebook (let us know in the group chat if you want to make any changes to these steps or have any feedback/concerns):
Target Class
roadSurface as the target class. There doesn't seem to be much of a difference between this category and traffic, apart from roadSurface having a slightly better balance between the classes, and its minority class ('FullofHolesCondition') might be easier to predict than traffic's smallest class as it's more correlated with other variables. I think the main thing is that we all work on the same target class to begin with, and we can always try a different target class later on if we have extra time.Missing Values
Feature Selection
roadSurface classes. I just eyeballed this by looking for variables that had very low correlation values for each class, there is probably a more technical way to do this (let me know if you have any ideas), or David you might be able to use your subject knowledge to give advice about what features you think we should use.VehicleSpeedVariation, AltitudeVariation, LongitudinalAcceleration, EngineLoad, DrivingStyle and the original index column (Unnamed: 0).After transforming the data, VerticalAcceleration still had a lot of outliers. I decided to keep it as a feature anyway, but we could try removing it, or just dropping those rows; however, it doesn't seem like these outliers have had much of an effect on the final model I built so maybe we can disregard them.
Update on feature selection:
I used a random forest classifier to find the importance of each feature after the training data (with all features) has gone through the preprocessing pipeline. I then removed the five least important features - VehicleSpeedVariation, AltitudeVariation, MassAirFlow, EngineLoad, DrivingStyle to leave us with a reduced number of features to use to train the models to avoid overfitting.
Transforming and Scaling
Models
HyperParameters
Metrics
#import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline
print(os.listdir('./data'))
['opel_corsa_01.csv', 'opel_corsa_02.csv', 'peugeot_207_01.csv', 'peugeot_207_02.csv']
df1 = pd.read_csv('./data/opel_corsa_01.csv', delimiter=';')
df1.dataframeName = 'opel_01.'
df2 = pd.read_csv('./data/opel_corsa_02.csv', delimiter=';')
df2.dataframeName = 'opel_02'
df3 = pd.read_csv('./data/peugeot_207_01.csv', delimiter=';')
df3.dataframeName = 'peugeot_01'
df4 = pd.read_csv('./data/peugeot_207_02.csv', delimiter=';')
df4.dataframeName = 'peugeot_02'
df_list = [df1, df2, df3, df4]
for df in df_list:
name = df.dataframeName
nRow, nCol = df.shape
print(f'In {name}, there are {nRow} rows and {nCol} columns. \n')
In opel_01., there are 7038 rows and 18 columns. In opel_02, there are 4092 rows and 18 columns. In peugeot_01, there are 8199 rows and 18 columns. In peugeot_02, there are 4446 rows and 18 columns.
We can see that the first trip for both cars is almost twice as long as the second trip.
Let's explore the individual datasets before joining them. First, let's take a look at the structure of the datasets.
df1.head()
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | roadSurface | traffic | drivingStyle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59 | -2.299988 | 25.670519 | 13.223501 | 121.592690 | -2.476980 | 0.3555 | 4.705883 | 68 | 106 | 1796 | 15.81 | 24 | -0.1133 | 19.497335 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
| 1 | 60 | -2.099976 | 24.094259 | 13.638919 | 120.422571 | -1.576260 | 0.4492 | 10.588236 | 68 | 103 | 1689 | 14.65 | 22 | -0.1289 | 19.515722 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
| 2 | 61 | -1.500000 | 22.743179 | 14.031043 | 118.456769 | -1.351080 | 0.4258 | 27.450981 | 68 | 103 | 1599 | 11.85 | 21 | -0.1328 | 19.441765 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
| 3 | 62 | 0.100037 | 22.292820 | 14.171073 | 117.571308 | -0.450359 | 0.4140 | 24.313726 | 69 | 104 | 1620 | 12.21 | 20 | -0.0859 | 19.388769 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
| 4 | 63 | 0.099976 | 23.643900 | 14.328954 | 117.074149 | 1.351080 | 0.3945 | 20.000000 | 69 | 104 | 1708 | 11.91 | 21 | -0.0664 | 19.301638 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
df1.tail()
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | roadSurface | traffic | drivingStyle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7033 | 7387 | -4.900024 | 110.788551 | 114.248823 | 37.916017 | 1.801430 | 0.0273 | 17.647058 | 82 | 111 | 2216 | 18.010000 | 21 | 0.1406 | 9.631930 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
| 7034 | 7388 | -5.200012 | 110.788551 | 114.079938 | 37.335264 | 0.000000 | 0.0625 | 23.137255 | 82 | 112 | 2209 | 16.900000 | 20 | 0.1289 | 9.565511 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
| 7035 | 7389 | -5.000000 | 111.689278 | 113.914806 | 36.446619 | 0.900726 | 0.0391 | 29.803923 | 82 | 113 | 2208 | 18.760000 | 20 | 0.1016 | 9.495973 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
| 7036 | 7390 | -5.200012 | 111.013740 | 113.693379 | 34.711628 | -0.675537 | 0.0625 | 37.647060 | 82 | 120 | 2210 | 21.690001 | 21 | 0.0742 | 9.433368 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
| 7037 | 7391 | -5.899963 | 108.086395 | 113.423163 | 33.263671 | -2.927345 | 0.1719 | 26.666668 | 81 | 121 | 2214 | 17.600000 | 21 | 0.1406 | 9.362569 | SmoothCondition | LowCongestionCondition | EvenPaceStyle |
#looking at the data types
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7038 entries, 0 to 7037 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 7038 non-null int64 1 AltitudeVariation 7038 non-null float64 2 VehicleSpeedInstantaneous 7038 non-null float64 3 VehicleSpeedAverage 7038 non-null float64 4 VehicleSpeedVariance 7038 non-null float64 5 VehicleSpeedVariation 7038 non-null float64 6 LongitudinalAcceleration 7038 non-null float64 7 EngineLoad 7038 non-null float64 8 EngineCoolantTemperature 7038 non-null int64 9 ManifoldAbsolutePressure 7038 non-null int64 10 EngineRPM 7038 non-null int64 11 MassAirFlow 7038 non-null float64 12 IntakeAirTemperature 7038 non-null int64 13 VerticalAcceleration 7038 non-null float64 14 FuelConsumptionAverage 7038 non-null float64 15 roadSurface 7038 non-null object 16 traffic 7038 non-null object 17 drivingStyle 7038 non-null object dtypes: float64(10), int64(5), object(3) memory usage: 989.8+ KB
#descriptive statistics
df1.describe()
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 | 7038.000000 |
| mean | 3772.072748 | -0.675845 | 36.428319 | 36.723932 | 213.004353 | -0.029563 | 0.143530 | 26.487416 | 77.924979 | 116.234157 | 1569.145354 | 16.358274 | 16.048878 | 0.055929 | 15.446272 |
| std | 2118.196795 | 1.691601 | 32.901312 | 29.366391 | 205.717663 | 2.390997 | 0.744697 | 19.462750 | 7.076616 | 20.660674 | 551.406613 | 9.488889 | 4.342145 | 0.379679 | 4.311013 |
| min | 59.000000 | -9.200012 | 0.000000 | 0.000000 | 0.000000 | -17.789218 | -2.380000 | 0.000000 | 40.000000 | 98.000000 | 752.000000 | 4.010000 | 7.000000 | -1.246000 | 7.271883 |
| 25% | 1936.250000 | -1.500000 | 8.782019 | 16.698035 | 54.333652 | -0.900722 | -0.339800 | 13.725491 | 79.000000 | 102.000000 | 936.000000 | 8.080000 | 12.000000 | -0.222700 | 12.319374 |
| 50% | 3813.500000 | -0.399963 | 29.273399 | 28.312631 | 144.864363 | 0.000000 | 0.140800 | 25.490196 | 80.000000 | 109.000000 | 1659.500000 | 15.330000 | 16.000000 | 0.070300 | 15.284765 |
| 75% | 5612.750000 | 0.100037 | 54.043198 | 47.595544 | 299.377339 | 0.900721 | 0.683600 | 34.901962 | 81.000000 | 122.000000 | 2033.000000 | 21.690001 | 19.000000 | 0.312000 | 18.393147 |
| max | 7391.000000 | 5.200012 | 124.749725 | 121.330733 | 1051.789888 | 12.384899 | 2.360000 | 100.000000 | 85.000000 | 252.000000 | 3104.000000 | 73.250000 | 34.000000 | 1.210000 | 25.666862 |
df2.describe()
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 |
| mean | 2198.175953 | -0.139590 | 43.482246 | 43.419217 | 162.980180 | -0.011556 | 1.603635 | 34.509613 | 73.378299 | 123.764907 | 1656.040811 | 18.608326 | 18.521261 | 4.061078 | 17.400087 |
| std | 1248.536624 | 2.484872 | 37.543881 | 35.323457 | 162.805408 | 2.227681 | 3.057510 | 23.559170 | 12.875391 | 30.908416 | 575.566227 | 12.334384 | 3.408300 | 7.221253 | 5.970610 |
| min | 44.000000 | -8.299988 | 0.000000 | 0.000000 | 0.000000 | -11.934539 | -1.710800 | 0.000000 | 40.000000 | 98.000000 | 760.000000 | 4.270000 | 10.000000 | -1.140000 | 7.929113 |
| 25% | 1125.750000 | -2.800003 | 11.259000 | 14.868447 | 42.682230 | -0.900719 | -0.254950 | 20.000000 | 64.000000 | 101.000000 | 1087.000000 | 8.445000 | 16.000000 | -0.007800 | 11.634232 |
| 50% | 2188.500000 | 0.000000 | 34.452538 | 26.422996 | 106.038210 | 0.000000 | 0.125000 | 30.588236 | 80.000000 | 110.000000 | 1762.500000 | 15.820000 | 18.000000 | 0.223000 | 18.079138 |
| 75% | 3285.250000 | 1.024994 | 70.312449 | 70.051620 | 240.313896 | 0.900722 | 1.060000 | 47.843140 | 83.000000 | 136.000000 | 2156.000000 | 25.330000 | 20.000000 | 0.716000 | 21.760530 |
| max | 4327.000000 | 10.700012 | 122.723091 | 114.706688 | 956.695096 | 11.259000 | 8.477800 | 100.000000 | 89.000000 | 250.000000 | 3167.000000 | 67.309998 | 30.000000 | 17.944800 | 45.336861 |
df3.describe()
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8199.000000 | 8199.000000 | 8196.000000 | 8199.000000 | 8199.000000 | 8199.000000 | 8199.000000 | 8194.000000 | 8194.000000 | 8194.000000 | 8194.000000 | 8194.000000 | 8194.000000 | 8199.000000 | 8194.000000 |
| mean | 4289.953531 | -0.167142 | 46.627707 | 46.889535 | 159.545051 | -0.020856 | 1.127696 | 45.079854 | 68.191237 | 115.252990 | 1520.705394 | 16.992544 | 33.897120 | -0.649417 | 12.986515 |
| std | 2482.291344 | 2.271266 | 35.940316 | 33.502960 | 188.542234 | 2.760644 | 0.759994 | 30.615258 | 17.535489 | 16.322914 | 611.017251 | 9.174752 | 11.639077 | 0.640123 | 3.136726 |
| min | 59.000000 | -24.600006 | 0.000000 | 0.000000 | 0.000000 | -103.500000 | -1.457600 | 0.000000 | 12.000000 | 88.000000 | 0.000000 | 0.880000 | 8.000000 | -2.763100 | 7.847495 |
| 25% | 2118.500000 | -1.300003 | 18.900000 | 19.582499 | 41.323268 | -0.900000 | 0.578950 | 23.137255 | 56.000000 | 103.000000 | 898.625000 | 7.300000 | 24.000000 | -1.052800 | 10.086075 |
| 50% | 4285.000000 | -0.099998 | 37.799999 | 35.954999 | 103.799893 | 0.000000 | 1.161900 | 40.392159 | 79.000000 | 107.000000 | 1496.500000 | 17.219999 | 36.000000 | -0.649800 | 12.868294 |
| 75% | 6393.500000 | 0.900002 | 81.000000 | 75.337498 | 202.930619 | 0.900002 | 1.646400 | 75.686279 | 79.000000 | 126.000000 | 1975.375000 | 24.740000 | 41.000000 | -0.180550 | 14.868025 |
| max | 8613.000000 | 10.299999 | 119.699997 | 114.884996 | 1418.370369 | 97.199997 | 3.979800 | 100.000000 | 86.000000 | 170.000000 | 2802.500000 | 38.549999 | 65.000000 | 0.999900 | 27.919697 |
df4.describe()
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4446.000000 | 4446.000000 | 4440.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 | 4446.000000 |
| mean | 2333.006073 | 0.100135 | 22.962508 | 22.975994 | 137.937403 | -0.014014 | -0.137473 | 39.771728 | 62.056230 | 105.556230 | 1138.302632 | 11.446912 | 20.624606 | -0.150632 | 16.710177 |
| std | 1323.814689 | 1.797515 | 18.623617 | 14.424310 | 132.251407 | 2.258688 | 0.758267 | 25.467201 | 18.477506 | 5.390143 | 389.589388 | 5.967084 | 5.206478 | 0.560713 | 4.136263 |
| min | 59.000000 | -9.199997 | 0.000000 | 0.000000 | 0.000000 | -31.072817 | -3.065000 | 0.000000 | 15.000000 | 96.000000 | 0.000000 | 0.880000 | 11.000000 | -2.510900 | 10.344559 |
| 25% | 1170.250000 | -0.599998 | 5.400000 | 11.396250 | 48.322445 | -0.900000 | -0.563500 | 27.843138 | 47.000000 | 102.000000 | 780.500000 | 5.580000 | 17.000000 | -0.453725 | 12.978312 |
| 50% | 2322.500000 | 0.000000 | 22.500000 | 21.021283 | 103.170886 | 0.000000 | -0.217000 | 36.862747 | 66.000000 | 103.000000 | 1060.500000 | 10.360000 | 20.000000 | -0.051000 | 15.995884 |
| 75% | 3467.750000 | 0.900002 | 36.899998 | 35.459999 | 174.184256 | 0.900000 | 0.307475 | 60.000000 | 79.000000 | 107.000000 | 1473.500000 | 16.629999 | 24.000000 | 0.196100 | 20.450338 |
| max | 4622.000000 | 11.400002 | 72.000000 | 59.984998 | 864.046635 | 30.599998 | 2.244800 | 100.000000 | 86.000000 | 144.000000 | 2239.000000 | 30.990000 | 41.000000 | 1.501500 | 30.672386 |
14 Numeric Attributes:
AltitudeVariation - altitude change calculated over 10 seconds;
VehicleSpeedInstantaneous - current speed value;
VehicleSpeedAverage - average speed in the last 60 seconds;
VehicleSpeedVariance - speed variance in the last 60 seconds;
VehicleSpeedVariation - speed variation for every second of detection;
LongitudinalAcceleration - measured by the smartphone accelerometer and pre-processed with a low-pass filter;
EngineLoad - expressed as a percentage;
EngineCoolantTemperature - in celsius degree;
ManifoldAirPressure - (MAP), a parameter the internal combustion engine uses to compute the optimal air/fuel ratio;
EngineRPM - Revolutions per Minute of the engine;
MassAirFlow - (MAF) Rate measured in g/s, used by the engine to set fuel delivery and spark timing;
IntakeAirTemperature - (IAT) at the engine entrance;
VerticalAcceleration - measured by the smartphone accelerometer and pre-processed with a low-pass filter;
AverageFuelConsumption - calculated as needed liters per 100 km.
3 Categorical Attributes
roadSurface - 3 classes: SmoothCondition, FullOfHolesCondition, UnevenCondition;
traffic - 3 classes: LowCongestionCondition, NormalCongestionCondition, HighCongestionCondition;
drivingStyle - 2 classes: EvenPaceStyle, AggressiveStyle.
#checking for null values
for df in df_list:
print("Number of null values in ", df.dataframeName, ":\n")
print(df.isna().sum(), "\n")
Number of null values in opel_01. : Unnamed: 0 0 AltitudeVariation 0 VehicleSpeedInstantaneous 0 VehicleSpeedAverage 0 VehicleSpeedVariance 0 VehicleSpeedVariation 0 LongitudinalAcceleration 0 EngineLoad 0 EngineCoolantTemperature 0 ManifoldAbsolutePressure 0 EngineRPM 0 MassAirFlow 0 IntakeAirTemperature 0 VerticalAcceleration 0 FuelConsumptionAverage 0 roadSurface 0 traffic 0 drivingStyle 0 dtype: int64 Number of null values in opel_02 : Unnamed: 0 0 AltitudeVariation 0 VehicleSpeedInstantaneous 0 VehicleSpeedAverage 0 VehicleSpeedVariance 0 VehicleSpeedVariation 0 LongitudinalAcceleration 0 EngineLoad 0 EngineCoolantTemperature 0 ManifoldAbsolutePressure 0 EngineRPM 0 MassAirFlow 0 IntakeAirTemperature 0 VerticalAcceleration 0 FuelConsumptionAverage 0 roadSurface 0 traffic 0 drivingStyle 0 dtype: int64 Number of null values in peugeot_01 : Unnamed: 0 0 AltitudeVariation 0 VehicleSpeedInstantaneous 3 VehicleSpeedAverage 0 VehicleSpeedVariance 0 VehicleSpeedVariation 0 LongitudinalAcceleration 0 EngineLoad 5 EngineCoolantTemperature 5 ManifoldAbsolutePressure 5 EngineRPM 5 MassAirFlow 5 IntakeAirTemperature 5 VerticalAcceleration 0 FuelConsumptionAverage 5 roadSurface 0 traffic 0 drivingStyle 0 dtype: int64 Number of null values in peugeot_02 : Unnamed: 0 0 AltitudeVariation 0 VehicleSpeedInstantaneous 6 VehicleSpeedAverage 0 VehicleSpeedVariance 0 VehicleSpeedVariation 0 LongitudinalAcceleration 0 EngineLoad 0 EngineCoolantTemperature 0 ManifoldAbsolutePressure 0 EngineRPM 0 MassAirFlow 0 IntakeAirTemperature 0 VerticalAcceleration 0 FuelConsumptionAverage 0 roadSurface 0 traffic 0 drivingStyle 0 dtype: int64
There are only a few missing values in the Peugeot datasets, we can deal with these at the preprocessing stage.
Let's look at correlation between the numerical variables in each dataset.
def corr_heatmap(df, width=10, height=8):
corr_matrix = df.corr()
plt.figure(figsize=(width,height))
sns.heatmap(corr_matrix, annot=True)
corr_heatmap(df1)
corr_heatmap(df2)
corr_heatmap(df3)
corr_heatmap(df4)
Some initial observations:
VehicleSpeedInstantaneous and VehicleSpeedAverage - although this relationship is not as strong for the peugeot_02 dataset. ManifoldAbsolutePressure, EngineRPM and MassAirFlow are strongly correlated with each other and with VehicleSpeedInstantaneous and VehicleSpeedAverage in all datasets.FuelConsumptionAverage is negatively correlated with VehicleSpeedAverage in all datasets, and it also has a significant negative correlation with EngineCoolantTemperature and IntakeAirTemperature in the Peugeot datasets. LongitudinalAcceleration and VerticalAcceleration in opel_01 and peugeot_02, however they are extremely positively correlated in the opel_02 dataset.Now let's have a look at the distribution of the numerical variables in each dataset.
df1.hist(bins=50, figsize=(20,15))
array([[<AxesSubplot:title={'center':'Unnamed: 0'}>,
<AxesSubplot:title={'center':'AltitudeVariation'}>,
<AxesSubplot:title={'center':'VehicleSpeedInstantaneous'}>,
<AxesSubplot:title={'center':'VehicleSpeedAverage'}>],
[<AxesSubplot:title={'center':'VehicleSpeedVariance'}>,
<AxesSubplot:title={'center':'VehicleSpeedVariation'}>,
<AxesSubplot:title={'center':'LongitudinalAcceleration'}>,
<AxesSubplot:title={'center':'EngineLoad'}>],
[<AxesSubplot:title={'center':'EngineCoolantTemperature'}>,
<AxesSubplot:title={'center':'ManifoldAbsolutePressure'}>,
<AxesSubplot:title={'center':'EngineRPM'}>,
<AxesSubplot:title={'center':'MassAirFlow'}>],
[<AxesSubplot:title={'center':'IntakeAirTemperature'}>,
<AxesSubplot:title={'center':'VerticalAcceleration'}>,
<AxesSubplot:title={'center':'FuelConsumptionAverage'}>,
<AxesSubplot:>]], dtype=object)
df2.hist(bins=50, figsize=(20,15))
array([[<AxesSubplot:title={'center':'Unnamed: 0'}>,
<AxesSubplot:title={'center':'AltitudeVariation'}>,
<AxesSubplot:title={'center':'VehicleSpeedInstantaneous'}>,
<AxesSubplot:title={'center':'VehicleSpeedAverage'}>],
[<AxesSubplot:title={'center':'VehicleSpeedVariance'}>,
<AxesSubplot:title={'center':'VehicleSpeedVariation'}>,
<AxesSubplot:title={'center':'LongitudinalAcceleration'}>,
<AxesSubplot:title={'center':'EngineLoad'}>],
[<AxesSubplot:title={'center':'EngineCoolantTemperature'}>,
<AxesSubplot:title={'center':'ManifoldAbsolutePressure'}>,
<AxesSubplot:title={'center':'EngineRPM'}>,
<AxesSubplot:title={'center':'MassAirFlow'}>],
[<AxesSubplot:title={'center':'IntakeAirTemperature'}>,
<AxesSubplot:title={'center':'VerticalAcceleration'}>,
<AxesSubplot:title={'center':'FuelConsumptionAverage'}>,
<AxesSubplot:>]], dtype=object)
df3.hist(bins=50, figsize=(20,15))
array([[<AxesSubplot:title={'center':'Unnamed: 0'}>,
<AxesSubplot:title={'center':'AltitudeVariation'}>,
<AxesSubplot:title={'center':'VehicleSpeedInstantaneous'}>,
<AxesSubplot:title={'center':'VehicleSpeedAverage'}>],
[<AxesSubplot:title={'center':'VehicleSpeedVariance'}>,
<AxesSubplot:title={'center':'VehicleSpeedVariation'}>,
<AxesSubplot:title={'center':'LongitudinalAcceleration'}>,
<AxesSubplot:title={'center':'EngineLoad'}>],
[<AxesSubplot:title={'center':'EngineCoolantTemperature'}>,
<AxesSubplot:title={'center':'ManifoldAbsolutePressure'}>,
<AxesSubplot:title={'center':'EngineRPM'}>,
<AxesSubplot:title={'center':'MassAirFlow'}>],
[<AxesSubplot:title={'center':'IntakeAirTemperature'}>,
<AxesSubplot:title={'center':'VerticalAcceleration'}>,
<AxesSubplot:title={'center':'FuelConsumptionAverage'}>,
<AxesSubplot:>]], dtype=object)
df4.hist(bins=50, figsize=(20,15))
array([[<AxesSubplot:title={'center':'Unnamed: 0'}>,
<AxesSubplot:title={'center':'AltitudeVariation'}>,
<AxesSubplot:title={'center':'VehicleSpeedInstantaneous'}>,
<AxesSubplot:title={'center':'VehicleSpeedAverage'}>],
[<AxesSubplot:title={'center':'VehicleSpeedVariance'}>,
<AxesSubplot:title={'center':'VehicleSpeedVariation'}>,
<AxesSubplot:title={'center':'LongitudinalAcceleration'}>,
<AxesSubplot:title={'center':'EngineLoad'}>],
[<AxesSubplot:title={'center':'EngineCoolantTemperature'}>,
<AxesSubplot:title={'center':'ManifoldAbsolutePressure'}>,
<AxesSubplot:title={'center':'EngineRPM'}>,
<AxesSubplot:title={'center':'MassAirFlow'}>],
[<AxesSubplot:title={'center':'IntakeAirTemperature'}>,
<AxesSubplot:title={'center':'VerticalAcceleration'}>,
<AxesSubplot:title={'center':'FuelConsumptionAverage'}>,
<AxesSubplot:>]], dtype=object)
For each dataset, the graphs for VehicleSpeedInstantaneous, VehicleSpeedVariance, ManifoldAbsolutePressure, EngineRPM and MassAirFlow are really right-skewed, while EngineCoolantTemperature has a long left tail in each of the datasets. We may need to do some kind of transformations on these variables to try create more normal distributions.
We can also see there is a clear difference in the shape of the distribution for VerticalAcceleration and LongitudinalAcceleration between the opel_02 dataset and the others. Let's take a closer look.
figure, axis = plt.subplots(2, 2, figsize=(15,10))
column = 'VerticalAcceleration'
sns.distplot(df1[column], ax=axis[0, 0])
axis[0, 0].set_title(df1.dataframeName)
sns.distplot(df2[column], ax=axis[0, 1])
axis[0, 1].set_title(df2.dataframeName)
sns.distplot(df3[column], ax=axis[1, 0])
axis[1, 0].set_title(df3.dataframeName)
sns.distplot(df4[column], ax=axis[1, 1])
axis[1, 1].set_title(df4.dataframeName)
plt.setp(axis, ylim=(0,1.4))
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
[0.0, 1.4, 0.0, 1.4, 0.0, 1.4, 0.0, 1.4]
column = 'VerticalAcceleration'
plt.figure(figsize=(14,8))
for df in df_list:
sns.distplot(df[column], label=df.dataframeName)
plt.legend()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<matplotlib.legend.Legend at 0x268dfae1520>
column = 'VerticalAcceleration'
plt.figure(figsize=(14,8))
for df in df_list:
sns.lineplot(df['Unnamed: 0'], df[column], label=df.dataframeName, alpha=0.6)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
figure, axis = plt.subplots(2, 2, figsize=(15,10))
column = 'LongitudinalAcceleration'
sns.distplot(df1[column], ax=axis[0, 0])
axis[0, 0].set_title(df1.dataframeName)
sns.distplot(df2[column], ax=axis[0, 1])
axis[0, 1].set_title(df2.dataframeName)
sns.distplot(df3[column], ax=axis[1, 0])
axis[1, 0].set_title(df3.dataframeName)
sns.distplot(df4[column], ax=axis[1, 1])
axis[1, 1].set_title(df4.dataframeName)
plt.setp(axis, ylim=(0,0.9))
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
[0.0, 0.9, 0.0, 0.9, 0.0, 0.9, 0.0, 0.9]
plt.figure(figsize=(14,8))
for df in df_list:
sns.distplot(df[column], label=df.dataframeName)
plt.legend()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<matplotlib.legend.Legend at 0x268e1e7b040>
df2.describe()
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 | 4092.000000 |
| mean | 2198.175953 | -0.139590 | 43.482246 | 43.419217 | 162.980180 | -0.011556 | 1.603635 | 34.509613 | 73.378299 | 123.764907 | 1656.040811 | 18.608326 | 18.521261 | 4.061078 | 17.400087 |
| std | 1248.536624 | 2.484872 | 37.543881 | 35.323457 | 162.805408 | 2.227681 | 3.057510 | 23.559170 | 12.875391 | 30.908416 | 575.566227 | 12.334384 | 3.408300 | 7.221253 | 5.970610 |
| min | 44.000000 | -8.299988 | 0.000000 | 0.000000 | 0.000000 | -11.934539 | -1.710800 | 0.000000 | 40.000000 | 98.000000 | 760.000000 | 4.270000 | 10.000000 | -1.140000 | 7.929113 |
| 25% | 1125.750000 | -2.800003 | 11.259000 | 14.868447 | 42.682230 | -0.900719 | -0.254950 | 20.000000 | 64.000000 | 101.000000 | 1087.000000 | 8.445000 | 16.000000 | -0.007800 | 11.634232 |
| 50% | 2188.500000 | 0.000000 | 34.452538 | 26.422996 | 106.038210 | 0.000000 | 0.125000 | 30.588236 | 80.000000 | 110.000000 | 1762.500000 | 15.820000 | 18.000000 | 0.223000 | 18.079138 |
| 75% | 3285.250000 | 1.024994 | 70.312449 | 70.051620 | 240.313896 | 0.900722 | 1.060000 | 47.843140 | 83.000000 | 136.000000 | 2156.000000 | 25.330000 | 20.000000 | 0.716000 | 21.760530 |
| max | 4327.000000 | 10.700012 | 122.723091 | 114.706688 | 956.695096 | 11.259000 | 8.477800 | 100.000000 | 89.000000 | 250.000000 | 3167.000000 | 67.309998 | 30.000000 | 17.944800 | 45.336861 |
plt.figure(figsize=(14,8))
for df in df_list:
sns.lineplot(df['Unnamed: 0'], df[column], label=df.dataframeName, alpha=0.6)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
We can see that there are two distinct peaks for both variables at the same time in the opel_02 journey (could be caused by the smartphone accelerometer malfunctioning at these times?) The higher values for each of these variables in the opel_02 dataset could cause the overall dataset to have some outliers, so we should keep an eye on these variables later.
def get_class_proportions(df, category):
print("\n", df.dataframeName, ":")
for i in range(len(list(df[category].unique()))):
print(df[category].unique()[i])
print(df[category].value_counts()[i], "-", round((df[category].value_counts()[i])/df.shape[0]*100, 2),"%")
category = 'drivingStyle'
for df in df_list:
get_class_proportions(df, category)
opel_01. : EvenPaceStyle 5751 - 81.71 % AggressiveStyle 1287 - 18.29 % opel_02 : EvenPaceStyle 3290 - 80.4 % AggressiveStyle 802 - 19.6 % peugeot_01 : EvenPaceStyle 7716 - 94.11 % AggressiveStyle 483 - 5.89 % peugeot_02 : EvenPaceStyle 4259 - 95.79 % AggressiveStyle 187 - 4.21 %
figure, axis = plt.subplots(2, 2, figsize=(12,10))
palette_dict = dict(EvenPaceStyle="g", AggressiveStyle="r")
sns.countplot(df1[category], palette=palette_dict, ax=axis[0, 0])
axis[0, 0].set_title(df1.dataframeName)
sns.countplot(df2[category], palette=palette_dict, ax=axis[0, 1])
axis[0, 1].set_title(df2.dataframeName)
sns.countplot(df3[category], palette=palette_dict, ax=axis[1, 0])
axis[1, 0].set_title(df3.dataframeName)
sns.countplot(df4[category], palette=palette_dict, ax=axis[1, 1])
axis[1, 1].set_title(df4.dataframeName)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'peugeot_02')
In general, the distribution of driving styles in each dataset seems to be quite similar, although the Peugeot datasets have about 15% less aggressive style driving compared to the Opel datasets.
Now let's look at the roadSurface.
category = 'roadSurface'
for df in df_list:
get_class_proportions(df, category)
opel_01. : SmoothCondition 6873 - 97.66 % UnevenCondition 165 - 2.34 % opel_02 : SmoothCondition 3812 - 93.16 % UnevenCondition 280 - 6.84 % peugeot_01 : SmoothCondition 3274 - 39.93 % FullOfHolesCondition 3042 - 37.1 % UnevenCondition 1883 - 22.97 % peugeot_02 : UnevenCondition 2802 - 63.02 % FullOfHolesCondition 1366 - 30.72 % SmoothCondition 278 - 6.25 %
figure, axis = plt.subplots(2, 2, figsize=(12,10))
palette_dict = dict(SmoothCondition="g", UnevenCondition="b", FullOfHolesCondition="r")
sns.countplot(df1[category], palette=palette_dict, ax=axis[0, 0])
axis[0, 0].set_title(df1.dataframeName)
sns.countplot(df2[category], palette=palette_dict, ax=axis[0, 1])
axis[0, 1].set_title(df2.dataframeName)
sns.countplot(df3[category], palette=palette_dict, ax=axis[1, 0])
axis[1, 0].set_title(df3.dataframeName)
sns.countplot(df4[category], palette=palette_dict, ax=axis[1, 1])
axis[1, 1].set_title(df4.dataframeName)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'peugeot_02')
From the statistics and charts above we see there is a big inbalance between the datasets for this target class. While the Opel trips were driven on mostly smooth roads, the two Peugeot trips were on considerably worst road surfaces, especially the second Peugeot trip where only 6% on the roads were in a smooth condition. We would need to accomodate for this inbalance when splitting the final dataset into train and test sets if we choose this as our target class for the model.
The last category class we will compare is traffic.
traffic_conditions = {'LowCongestionCondition':'Low', 'NormalCongestionCondition':'Medium',
'HighCongestionCondition':'High'}
for df in df_list:
df.traffic = df.traffic.map(traffic_conditions)
category = 'traffic'
for df in df_list:
get_class_proportions(df, category)
opel_01. : Low 6461 - 91.8 % Medium 449 - 6.38 % High 128 - 1.82 % opel_02 : Low 3591 - 87.76 % High 405 - 9.9 % Medium 96 - 2.35 % peugeot_01 : Low 6844 - 83.47 % Medium 696 - 8.49 % High 659 - 8.04 % peugeot_02 : Medium 1813 - 40.78 % High 1765 - 39.7 % Low 868 - 19.52 %
figure, axis = plt.subplots(2, 2, figsize=(12,10))
palette_dict = dict(Low="g", Medium="b", High="r")
sns.countplot(df1[category], palette=palette_dict, ax=axis[0, 0])
axis[0, 0].set_title(df1.dataframeName)
sns.countplot(df2[category], palette=palette_dict, ax=axis[0, 1])
axis[0, 1].set_title(df2.dataframeName)
sns.countplot(df3[category], palette=palette_dict, ax=axis[1, 0])
axis[1, 0].set_title(df3.dataframeName)
sns.countplot(df4[category], palette=palette_dict, ax=axis[1, 1])
axis[1, 1].set_title(df4.dataframeName)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'peugeot_02')
While the traffic congestion conditions for the first three datasets are quite similar, there was clearly much more congestion during the last trip. Let's visualize the vehicle speed during the different trips to see if if the high traffic for trip 4 is noticable.
plt.figure(figsize=(14,8))
for df in df_list:
plt.fill_between(df['Unnamed: 0'], df['VehicleSpeedAverage'], label=df.dataframeName, alpha=0.8)
plt.legend()
<matplotlib.legend.Legend at 0x268e27234c0>
figure, axis = plt.subplots(2, 2, figsize=(15,10))
y_column = 'VehicleSpeedInstantaneous'
x_column = 'VehicleSpeedAverage'
plt.setp(axis, ylim=(0,130), xlim=(0,130))
sns.scatterplot(x=x_column, y=y_column, data=df1, alpha=0.5, hue='traffic',
palette=palette_dict, ax=axis[0,0])
axis[0, 0].set_title(df1.dataframeName)
sns.scatterplot(x=x_column, y=y_column, data=df2, alpha=0.5, hue='traffic',
palette=palette_dict, ax=axis[0,1])
axis[0, 1].set_title(df2.dataframeName)
sns.scatterplot(x=x_column, y=y_column, data=df3, alpha=0.5, hue='traffic',
palette=palette_dict, ax=axis[1,0])
axis[1, 0].set_title(df3.dataframeName)
sns.scatterplot(x=x_column, y=y_column, data=df4, alpha=0.5, hue='traffic',
palette=palette_dict, ax=axis[1,1])
axis[1, 1].set_title(df4.dataframeName)
Text(0.5, 1.0, 'peugeot_02')
From these charts we can see that when there is mainly low traffic congestion the vehicle can go at greater speeds, but during trip 4 there was too much medium to high traffic congestion to allow the vehicle to reach higher speeds.
So we can make an assumption that once the speed goes over a certain threshold, there is a higher probability that there is low traffic.
Let's see if it's the same for road surface, do higher speeds indicate that the road surface was probably in a better condition?
figure, axis = plt.subplots(2, 2, figsize=(15,10))
palette_dict = dict(SmoothCondition="g", UnevenCondition="b", FullOfHolesCondition="r")
y_column = 'VehicleSpeedInstantaneous'
x_column = 'VehicleSpeedAverage'
plt.setp(axis, ylim=(0,130), xlim=(0,130))
sns.scatterplot(x=x_column, y=y_column, data=df1, alpha=0.5, hue='roadSurface',
palette=palette_dict, ax=axis[0,0])
axis[0, 0].set_title(df1.dataframeName)
sns.scatterplot(x=x_column, y=y_column, data=df2, alpha=0.5, hue='roadSurface',
palette=palette_dict, ax=axis[0,1])
axis[0, 1].set_title(df2.dataframeName)
sns.scatterplot(x=x_column, y=y_column, data=df3, alpha=0.5, hue='roadSurface',
palette=palette_dict, ax=axis[1,0])
axis[1, 0].set_title(df3.dataframeName)
sns.scatterplot(x=x_column, y=y_column, data=df4, alpha=0.5, hue='roadSurface',
palette=palette_dict, ax=axis[1,1])
axis[1, 1].set_title(df4.dataframeName)
Text(0.5, 1.0, 'peugeot_02')
In the first 2 trips, high speeds seem to go hand-in-hand with smooth roads, but in the third trip (peugeot_01) there were slightly higher speeds when the road was full of holes compared to just an uneven condition, and for the fourth trip (peugeot_02) higher speeds were reached on both roads with uneven condition and roads full of holes, compared to smooth roads, so maybe speed isn't quite as reliable as a predictor for road surface condition as it is for traffic congestion.
Now let's use one-hot encoding to turn the categories into numerical values and further explore how they are correlated to the other variables using correlation matrices and heatmaps.
df1_dummies = pd.get_dummies(df1)
df2_dummies = pd.get_dummies(df2)
df3_dummies = pd.get_dummies(df3)
df4_dummies = pd.get_dummies(df4)
corr_heatmap(df1_dummies,15,12)
corr_heatmap(df2_dummies,15,12)
corr_heatmap(df3_dummies,15,12)
corr_heatmap(df4_dummies,15,12)
df_dummies = pd.DataFrame(columns=list(df1_dummies.columns))
df_dummies = pd.concat([df1_dummies, df2_dummies, df3_dummies, df4_dummies], axis=0)
print(df_dummies.shape)
df_dummies.head()
(23775, 23)
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | ... | VerticalAcceleration | FuelConsumptionAverage | roadSurface_SmoothCondition | roadSurface_UnevenCondition | traffic_High | traffic_Low | traffic_Medium | drivingStyle_AggressiveStyle | drivingStyle_EvenPaceStyle | roadSurface_FullOfHolesCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59 | -2.299988 | 25.670519 | 13.223501 | 121.592690 | -2.476980 | 0.3555 | 4.705883 | 68.0 | 106.0 | ... | -0.1133 | 19.497335 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | NaN |
| 1 | 60 | -2.099976 | 24.094259 | 13.638919 | 120.422571 | -1.576260 | 0.4492 | 10.588236 | 68.0 | 103.0 | ... | -0.1289 | 19.515722 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | NaN |
| 2 | 61 | -1.500000 | 22.743179 | 14.031043 | 118.456769 | -1.351080 | 0.4258 | 27.450981 | 68.0 | 103.0 | ... | -0.1328 | 19.441765 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | NaN |
| 3 | 62 | 0.100037 | 22.292820 | 14.171073 | 117.571308 | -0.450359 | 0.4140 | 24.313726 | 69.0 | 104.0 | ... | -0.0859 | 19.388769 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | NaN |
| 4 | 63 | 0.099976 | 23.643900 | 14.328954 | 117.074149 | 1.351080 | 0.3945 | 20.000000 | 69.0 | 104.0 | ... | -0.0664 | 19.301638 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | NaN |
5 rows × 23 columns
df_dummies.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 23775 entries, 0 to 4445 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 23775 non-null int64 1 AltitudeVariation 23775 non-null float64 2 VehicleSpeedInstantaneous 23766 non-null float64 3 VehicleSpeedAverage 23775 non-null float64 4 VehicleSpeedVariance 23775 non-null float64 5 VehicleSpeedVariation 23775 non-null float64 6 LongitudinalAcceleration 23775 non-null float64 7 EngineLoad 23770 non-null float64 8 EngineCoolantTemperature 23770 non-null float64 9 ManifoldAbsolutePressure 23770 non-null float64 10 EngineRPM 23770 non-null float64 11 MassAirFlow 23770 non-null float64 12 IntakeAirTemperature 23770 non-null float64 13 VerticalAcceleration 23775 non-null float64 14 FuelConsumptionAverage 23770 non-null float64 15 roadSurface_SmoothCondition 23775 non-null uint8 16 roadSurface_UnevenCondition 23775 non-null uint8 17 traffic_High 23775 non-null uint8 18 traffic_Low 23775 non-null uint8 19 traffic_Medium 23775 non-null uint8 20 drivingStyle_AggressiveStyle 23775 non-null uint8 21 drivingStyle_EvenPaceStyle 23775 non-null uint8 22 roadSurface_FullOfHolesCondition 12645 non-null float64 dtypes: float64(15), int64(1), uint8(7) memory usage: 3.2 MB
df_dummies = df_dummies.fillna(0) # filling the empty rows for full of holes condition as that category was missing in two datasets
df_dummies.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 23775 entries, 0 to 4445 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 23775 non-null int64 1 AltitudeVariation 23775 non-null float64 2 VehicleSpeedInstantaneous 23775 non-null float64 3 VehicleSpeedAverage 23775 non-null float64 4 VehicleSpeedVariance 23775 non-null float64 5 VehicleSpeedVariation 23775 non-null float64 6 LongitudinalAcceleration 23775 non-null float64 7 EngineLoad 23775 non-null float64 8 EngineCoolantTemperature 23775 non-null float64 9 ManifoldAbsolutePressure 23775 non-null float64 10 EngineRPM 23775 non-null float64 11 MassAirFlow 23775 non-null float64 12 IntakeAirTemperature 23775 non-null float64 13 VerticalAcceleration 23775 non-null float64 14 FuelConsumptionAverage 23775 non-null float64 15 roadSurface_SmoothCondition 23775 non-null uint8 16 roadSurface_UnevenCondition 23775 non-null uint8 17 traffic_High 23775 non-null uint8 18 traffic_Low 23775 non-null uint8 19 traffic_Medium 23775 non-null uint8 20 drivingStyle_AggressiveStyle 23775 non-null uint8 21 drivingStyle_EvenPaceStyle 23775 non-null uint8 22 roadSurface_FullOfHolesCondition 23775 non-null float64 dtypes: float64(15), int64(1), uint8(7) memory usage: 3.2 MB
corr_heatmap(df_dummies, 15, 10) #heatmap for all the datasets combined
Comparing the correlation with roadSurface classes and the rest of the dataset.
corr_matrix = df_dummies.corr()
corr_matrix['roadSurface_SmoothCondition'].sort_values(ascending=False, key=abs)
roadSurface_SmoothCondition 1.000000 roadSurface_UnevenCondition -0.732700 roadSurface_FullOfHolesCondition -0.486075 traffic_Low 0.406635 EngineRPM 0.405308 EngineCoolantTemperature 0.386893 ManifoldAbsolutePressure 0.363047 VehicleSpeedAverage 0.362164 VehicleSpeedInstantaneous 0.331442 MassAirFlow 0.322034 traffic_Medium -0.268453 traffic_High -0.263392 VerticalAcceleration 0.163430 drivingStyle_AggressiveStyle 0.150004 drivingStyle_EvenPaceStyle -0.150004 VehicleSpeedVariance 0.104172 Unnamed: 0 0.084757 IntakeAirTemperature -0.076922 FuelConsumptionAverage -0.069356 LongitudinalAcceleration 0.051849 AltitudeVariation -0.009995 EngineLoad -0.009645 VehicleSpeedVariation -0.002225 Name: roadSurface_SmoothCondition, dtype: float64
corr_matrix['roadSurface_UnevenCondition'].sort_values(ascending=False, key=abs)
roadSurface_UnevenCondition 1.000000 roadSurface_SmoothCondition -0.732700 traffic_Low -0.461833 traffic_Medium 0.387456 EngineRPM -0.315107 VehicleSpeedAverage -0.294049 MassAirFlow -0.277078 ManifoldAbsolutePressure -0.274877 VehicleSpeedInstantaneous -0.271829 roadSurface_FullOfHolesCondition -0.238599 IntakeAirTemperature 0.224771 traffic_High 0.216853 VerticalAcceleration -0.107318 drivingStyle_AggressiveStyle -0.103855 drivingStyle_EvenPaceStyle 0.103855 VehicleSpeedVariance -0.084312 Unnamed: 0 -0.063293 EngineLoad -0.032652 FuelConsumptionAverage 0.031547 LongitudinalAcceleration -0.012922 EngineCoolantTemperature -0.011208 VehicleSpeedVariation -0.004375 AltitudeVariation -0.002259 Name: roadSurface_UnevenCondition, dtype: float64
corr_matrix['roadSurface_FullOfHolesCondition'].sort_values(ascending=False, key=abs)
roadSurface_FullOfHolesCondition 1.000000 EngineCoolantTemperature -0.537687 roadSurface_SmoothCondition -0.486075 roadSurface_UnevenCondition -0.238599 IntakeAirTemperature -0.178871 EngineRPM -0.173719 ManifoldAbsolutePressure -0.165075 VehicleSpeedAverage -0.139195 VehicleSpeedInstantaneous -0.123890 traffic_Medium -0.114473 MassAirFlow -0.103725 traffic_High 0.097382 VerticalAcceleration -0.095398 drivingStyle_AggressiveStyle -0.080686 drivingStyle_EvenPaceStyle 0.080686 FuelConsumptionAverage 0.058457 LongitudinalAcceleration -0.057393 EngineLoad 0.055693 VehicleSpeedVariance -0.040382 Unnamed: 0 -0.039668 AltitudeVariation 0.017164 traffic_Low 0.012802 VehicleSpeedVariation 0.008793 Name: roadSurface_FullOfHolesCondition, dtype: float64
Comparing the correlation with traffic classes and the rest of the dataset.
corr_matrix['traffic_High'].sort_values(ascending=False, key=abs)
traffic_High 1.000000 traffic_Low -0.655378 EngineRPM -0.334759 VehicleSpeedAverage -0.325359 VehicleSpeedInstantaneous -0.307442 MassAirFlow -0.298235 roadSurface_SmoothCondition -0.263392 FuelConsumptionAverage 0.251340 ManifoldAbsolutePressure -0.222664 roadSurface_UnevenCondition 0.216853 Unnamed: 0 -0.202326 VehicleSpeedVariance -0.155008 traffic_Medium -0.144706 EngineCoolantTemperature -0.130572 roadSurface_FullOfHolesCondition 0.097382 LongitudinalAcceleration -0.073743 drivingStyle_AggressiveStyle -0.066326 drivingStyle_EvenPaceStyle 0.066326 AltitudeVariation 0.024357 EngineLoad -0.020093 VerticalAcceleration -0.016109 IntakeAirTemperature -0.013768 VehicleSpeedVariation -0.002360 Name: traffic_High, dtype: float64
corr_matrix['traffic_Medium'].sort_values(ascending=False, key=abs)
traffic_Medium 1.000000 traffic_Low -0.652514 roadSurface_UnevenCondition 0.387456 roadSurface_SmoothCondition -0.268453 EngineCoolantTemperature -0.147419 traffic_High -0.144706 ManifoldAbsolutePressure -0.135753 EngineRPM -0.134645 MassAirFlow -0.115996 roadSurface_FullOfHolesCondition -0.114473 VehicleSpeedAverage -0.107162 VehicleSpeedInstantaneous -0.100988 Unnamed: 0 -0.100546 VehicleSpeedVariance 0.065516 FuelConsumptionAverage 0.045888 AltitudeVariation 0.037395 drivingStyle_AggressiveStyle -0.031446 drivingStyle_EvenPaceStyle 0.031446 EngineLoad 0.029695 LongitudinalAcceleration -0.027627 VerticalAcceleration -0.020862 IntakeAirTemperature 0.009993 VehicleSpeedVariation 0.002886 Name: traffic_Medium, dtype: float64
corr_matrix['traffic_Low'].sort_values(ascending=False, key=abs)
traffic_Low 1.000000 traffic_High -0.655378 traffic_Medium -0.652514 roadSurface_UnevenCondition -0.461833 roadSurface_SmoothCondition 0.406635 EngineRPM 0.359151 VehicleSpeedAverage 0.330973 MassAirFlow 0.316943 VehicleSpeedInstantaneous 0.312539 ManifoldAbsolutePressure 0.274149 Unnamed: 0 0.231699 FuelConsumptionAverage -0.227514 EngineCoolantTemperature 0.212527 LongitudinalAcceleration 0.077563 drivingStyle_AggressiveStyle 0.074799 drivingStyle_EvenPaceStyle -0.074799 VehicleSpeedVariance 0.068700 AltitudeVariation -0.047199 VerticalAcceleration 0.028261 roadSurface_FullOfHolesCondition 0.012802 EngineLoad -0.007279 IntakeAirTemperature 0.002916 VehicleSpeedVariation -0.000395 Name: traffic_Low, dtype: float64
df_joined = pd.DataFrame(columns=list(df1.columns)) #joining the datasets again without dummies
df_joined = pd.concat([df1, df2, df3, df4], axis=0)
print(df_joined.shape)
df_joined.head()
(23775, 18)
| Unnamed: 0 | AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | roadSurface | traffic | drivingStyle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59 | -2.299988 | 25.670519 | 13.223501 | 121.592690 | -2.476980 | 0.3555 | 4.705883 | 68.0 | 106.0 | 1796.0 | 15.81 | 24.0 | -0.1133 | 19.497335 | SmoothCondition | Low | EvenPaceStyle |
| 1 | 60 | -2.099976 | 24.094259 | 13.638919 | 120.422571 | -1.576260 | 0.4492 | 10.588236 | 68.0 | 103.0 | 1689.0 | 14.65 | 22.0 | -0.1289 | 19.515722 | SmoothCondition | Low | EvenPaceStyle |
| 2 | 61 | -1.500000 | 22.743179 | 14.031043 | 118.456769 | -1.351080 | 0.4258 | 27.450981 | 68.0 | 103.0 | 1599.0 | 11.85 | 21.0 | -0.1328 | 19.441765 | SmoothCondition | Low | EvenPaceStyle |
| 3 | 62 | 0.100037 | 22.292820 | 14.171073 | 117.571308 | -0.450359 | 0.4140 | 24.313726 | 69.0 | 104.0 | 1620.0 | 12.21 | 20.0 | -0.0859 | 19.388769 | SmoothCondition | Low | EvenPaceStyle |
| 4 | 63 | 0.099976 | 23.643900 | 14.328954 | 117.074149 | 1.351080 | 0.3945 | 20.000000 | 69.0 | 104.0 | 1708.0 | 11.91 | 21.0 | -0.0664 | 19.301638 | SmoothCondition | Low | EvenPaceStyle |
df_joined.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 23775 entries, 0 to 4445 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 23775 non-null int64 1 AltitudeVariation 23775 non-null float64 2 VehicleSpeedInstantaneous 23766 non-null float64 3 VehicleSpeedAverage 23775 non-null float64 4 VehicleSpeedVariance 23775 non-null float64 5 VehicleSpeedVariation 23775 non-null float64 6 LongitudinalAcceleration 23775 non-null float64 7 EngineLoad 23770 non-null float64 8 EngineCoolantTemperature 23770 non-null float64 9 ManifoldAbsolutePressure 23770 non-null float64 10 EngineRPM 23770 non-null float64 11 MassAirFlow 23770 non-null float64 12 IntakeAirTemperature 23770 non-null float64 13 VerticalAcceleration 23775 non-null float64 14 FuelConsumptionAverage 23770 non-null float64 15 roadSurface 23775 non-null object 16 traffic 23775 non-null object 17 drivingStyle 23775 non-null object dtypes: float64(14), int64(1), object(3) memory usage: 3.4+ MB
df_joined['traffic'].value_counts()
Low 17764 High 3017 Medium 2994 Name: traffic, dtype: int64
df_joined['roadSurface'].value_counts()
SmoothCondition 14237 UnevenCondition 6289 FullOfHolesCondition 3249 Name: roadSurface, dtype: int64
The classes for roadSurface seem to be slightly more balanced than the distribution of classes for traffic, so let's choose roadSurface as our target.
Also the minority class for roadSurface ('FullofHolesCondition') might be a bit easier to predict than traffic's smallest class ('Medium') as it's more correlated with other variables.
Because there is an imbalance of the classes in roadSurface, we will need to use a stratified train/test split to ensure that the y sets have a similar distribution of classes.
from sklearn.model_selection import train_test_split
# Build feature/target arrays
X = df_joined.drop(['roadSurface', 'Unnamed: 0'], axis=1)
y = df_joined['roadSurface']
# Create train/test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=101, test_size=0.2, stratify=y
) #using 'stratify' to make sure the target class in the test set has the same distribution as the training set
y_train.value_counts() #checking for similar distribution of target class
SmoothCondition 11390 UnevenCondition 5031 FullOfHolesCondition 2599 Name: roadSurface, dtype: int64
y_test.value_counts() #checking for similar distribution of target class
SmoothCondition 2847 UnevenCondition 1258 FullOfHolesCondition 650 Name: roadSurface, dtype: int64
X_train.isna().sum() #checking for missing values - we can deal with these in preprocessing pipeline
AltitudeVariation 0 VehicleSpeedInstantaneous 7 VehicleSpeedAverage 0 VehicleSpeedVariance 0 VehicleSpeedVariation 0 LongitudinalAcceleration 0 EngineLoad 5 EngineCoolantTemperature 5 ManifoldAbsolutePressure 5 EngineRPM 5 MassAirFlow 5 IntakeAirTemperature 5 VerticalAcceleration 0 FuelConsumptionAverage 5 traffic 0 drivingStyle 0 dtype: int64
X_train[X_train.isna().any(axis=1)]
| AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | traffic | drivingStyle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8134 | -0.300003 | NaN | 15.315254 | 80.259585 | 0.000000 | 0.8187 | NaN | NaN | NaN | NaN | NaN | NaN | -0.7979 | NaN | Low | EvenPaceStyle |
| 2669 | -0.099998 | NaN | 18.618836 | 146.685124 | -5.400000 | -0.0969 | 40.392159 | 48.0 | 102.0 | 788.5 | 5.520000 | 15.0 | -0.2312 | 17.308382 | High | EvenPaceStyle |
| 8136 | 0.000000 | 0.0 | 14.522033 | 85.453810 | 0.000000 | 0.9423 | NaN | NaN | NaN | NaN | NaN | NaN | -0.9309 | NaN | Low | EvenPaceStyle |
| 1299 | 1.200005 | NaN | 106.764404 | 7.678189 | -103.500000 | 1.6302 | 29.803923 | 79.0 | 121.0 | 2377.0 | 32.380001 | 43.0 | -1.6623 | 10.790896 | Low | EvenPaceStyle |
| 8137 | -0.400002 | 0.0 | 14.140677 | 87.844517 | 0.000000 | 0.9089 | NaN | NaN | NaN | NaN | NaN | NaN | -0.9132 | NaN | Low | EvenPaceStyle |
| 4095 | 1.600002 | NaN | 42.733725 | 82.404679 | -31.072817 | 0.8820 | 19.215687 | 61.0 | 105.0 | 1251.5 | 15.740000 | 20.0 | -0.1980 | 16.866726 | Medium | EvenPaceStyle |
| 5671 | 0.199997 | 0.0 | 13.789830 | 90.425060 | 0.000000 | 0.8626 | NaN | NaN | NaN | NaN | NaN | NaN | -0.9058 | NaN | Low | EvenPaceStyle |
| 4096 | 0.900002 | NaN | 42.411242 | 77.606988 | 0.000000 | 0.8603 | 0.000000 | 61.0 | 104.0 | 1185.0 | 15.720000 | 20.0 | -0.1512 | 16.808987 | Medium | EvenPaceStyle |
| 2670 | 0.000000 | NaN | 18.480320 | 148.106673 | 0.000000 | -0.1330 | 39.607845 | 48.0 | 102.0 | 787.5 | 5.440000 | 15.0 | -0.1316 | 17.308382 | High | EvenPaceStyle |
| 8135 | 0.099998 | 0.0 | 14.918644 | 83.016709 | 0.000000 | 0.9051 | NaN | NaN | NaN | NaN | NaN | NaN | -0.8546 | NaN | Low | EvenPaceStyle |
| 1300 | 2.300003 | NaN | 106.867239 | 7.178025 | 0.000000 | 1.6918 | 0.000000 | 79.0 | 123.0 | 2328.0 | 31.160000 | 42.0 | -1.7648 | 10.794626 | Low | EvenPaceStyle |
There aren't that many missing values so they are most likely missing at random, although there seems to be some successive rows missing values from one or two of the datasets. We could easily drop all rows with missing values as there are so few of them, but in this case let's fill them with median values in the preprocessing pipeline (median is probably more suitable than mean if the data is a bit skewed).
Let's look again at the distributions of the numerical values to check for skewness, and then let's try out different transformations and compare results to find the most suitable transformation for different columns.
X_train.describe()
| AltitudeVariation | VehicleSpeedInstantaneous | VehicleSpeedAverage | VehicleSpeedVariance | VehicleSpeedVariation | LongitudinalAcceleration | EngineLoad | EngineCoolantTemperature | ManifoldAbsolutePressure | EngineRPM | MassAirFlow | IntakeAirTemperature | VerticalAcceleration | FuelConsumptionAverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19020.000000 | 19013.000000 | 19020.000000 | 19020.000000 | 19020.000000 | 19020.000000 | 19015.000000 | 19015.000000 | 19015.000000 | 19015.000000 | 19015.000000 | 19015.000000 | 19020.000000 | 19015.000000 |
| mean | -0.275126 | 38.495054 | 38.611653 | 173.028641 | -0.031052 | 0.666013 | 36.767609 | 70.728004 | 115.164554 | 1485.650171 | 16.037813 | 23.419879 | 0.439920 | 15.185336 |
| std | 2.105322 | 33.660962 | 31.001497 | 183.095329 | 2.526903 | 1.566940 | 26.673599 | 15.737972 | 20.357926 | 576.620778 | 9.665254 | 10.850407 | 3.402525 | 4.593061 |
| min | -24.600006 | 0.000000 | 0.000000 | 0.000000 | -103.500000 | -3.065000 | 0.000000 | 12.000000 | 88.000000 | 0.000000 | 0.880000 | 7.000000 | -2.752900 | 7.271883 |
| 25% | -1.400002 | 11.700000 | 15.983088 | 46.551983 | -0.900002 | -0.223000 | 18.431374 | 63.000000 | 102.000000 | 854.500000 | 6.970000 | 16.000000 | -0.548800 | 11.697115 |
| 50% | -0.099998 | 31.500000 | 29.171657 | 115.317339 | 0.000000 | 0.382750 | 31.372549 | 79.000000 | 107.000000 | 1485.000000 | 15.380000 | 19.000000 | -0.077800 | 14.533261 |
| 75% | 0.700001 | 54.493561 | 49.829998 | 231.517405 | 0.900002 | 1.145225 | 58.431374 | 80.000000 | 120.000000 | 1951.000000 | 21.660000 | 31.000000 | 0.234000 | 18.297857 |
| max | 11.400002 | 124.749725 | 121.330733 | 1418.370369 | 97.199997 | 8.477800 | 100.000000 | 89.000000 | 252.000000 | 3167.000000 | 73.250000 | 65.000000 | 17.944800 | 45.336861 |
X_train_num = X_train.drop(['traffic', 'drivingStyle'], axis=1) #dropping category columns
# first checking for outliers
import scipy.stats as st
for column in X_train_num:
z = np.abs(st.zscore(X_train_num[column])) # getting the z-scores for all columns
print(column)
print(np.where(z > 3)) #checking if there are any outliers with a z-score > 3
AltitudeVariation
(array([ 88, 165, 247, 281, 314, 355, 531, 599, 931,
981, 1394, 1400, 1807, 1869, 1894, 1900, 1980, 1996,
2100, 2198, 2346, 2363, 2380, 2489, 2576, 2668, 2681,
2732, 2788, 2815, 2859, 2989, 3065, 3214, 3279, 3407,
3409, 3513, 3539, 3567, 3886, 3908, 3924, 4067, 4090,
4151, 4161, 4166, 4180, 4202, 4345, 4411, 4489, 4606,
4651, 4700, 4864, 4887, 4920, 5042, 5124, 5173, 5299,
5308, 5434, 5445, 5474, 5635, 5675, 5774, 5806, 5864,
5871, 5937, 6014, 6016, 6021, 6063, 6112, 6235, 6300,
6412, 6415, 6426, 6478, 6481, 6552, 6646, 6710, 6748,
6897, 7025, 7094, 7147, 7184, 7202, 7263, 7338, 7362,
7549, 7828, 7948, 7961, 8002, 8068, 8084, 8197, 8211,
8330, 8362, 8391, 8503, 8936, 8961, 8976, 9053, 9070,
9092, 9132, 9176, 9198, 9257, 9323, 9328, 9339, 9597,
9663, 9671, 9682, 9685, 9715, 9859, 9920, 9991, 10015,
10078, 10107, 10147, 10156, 10161, 10239, 10259, 10398, 10489,
10512, 10523, 10578, 10618, 10693, 10788, 11084, 11095, 11103,
11105, 11147, 11280, 11312, 11324, 11403, 11459, 11667, 11727,
11785, 11787, 11792, 11893, 12056, 12291, 12355, 12452, 12662,
12677, 12766, 12815, 12947, 13013, 13027, 13082, 13184, 13229,
13252, 13306, 13437, 13483, 13505, 13753, 13756, 13881, 13900,
13941, 13977, 14205, 14232, 14267, 14305, 14306, 14343, 14352,
14513, 14519, 14542, 14596, 14700, 14776, 14835, 14865, 14873,
14880, 14901, 14902, 14923, 14954, 15056, 15138, 15238, 15291,
15368, 15410, 15433, 15591, 15607, 15682, 15932, 15991, 16141,
16173, 16190, 16194, 16225, 16270, 16281, 16308, 16397, 16417,
16418, 16466, 16498, 16609, 16610, 16643, 16678, 16698, 16745,
16900, 16974, 17044, 17070, 17176, 17220, 17326, 17379, 17406,
17435, 17576, 17672, 17681, 17721, 17749, 17765, 17888, 17971,
18108, 18161, 18167, 18171, 18188, 18288, 18338, 18355, 18455,
18622, 18714, 18721, 18732, 18810, 18812, 18822, 18849, 18946,
18983], dtype=int64),)
VehicleSpeedInstantaneous
(array([], dtype=int64),)
VehicleSpeedAverage
(array([], dtype=int64),)
VehicleSpeedVariance
(array([ 50, 117, 122, 141, 276, 281, 344, 348, 353,
380, 435, 438, 505, 525, 561, 600, 604, 621,
650, 704, 758, 765, 771, 883, 901, 952, 957,
979, 981, 1054, 1097, 1120, 1142, 1147, 1172, 1206,
1213, 1221, 1298, 1301, 1307, 1396, 1448, 1460, 1463,
1469, 1539, 1556, 1636, 1716, 1732, 1740, 1750, 1760,
1814, 1954, 1976, 1983, 1994, 2003, 2036, 2076, 2107,
2111, 2114, 2141, 2256, 2270, 2302, 2325, 2380, 2384,
2393, 2427, 2455, 2474, 2579, 2602, 2649, 2661, 2671,
2705, 2706, 2707, 2720, 2748, 2771, 2776, 2835, 2920,
2947, 2955, 2958, 3034, 3214, 3300, 3404, 3468, 3474,
3504, 3576, 3601, 3675, 3678, 3717, 3723, 3724, 3767,
3795, 3852, 3876, 3915, 4126, 4173, 4234, 4298, 4308,
4422, 4456, 4492, 4541, 4634, 4658, 4660, 4891, 4951,
4990, 5077, 5088, 5091, 5169, 5173, 5213, 5234, 5239,
5243, 5281, 5304, 5305, 5325, 5334, 5398, 5442, 5514,
5524, 5542, 5544, 5593, 5634, 5649, 5666, 5745, 5809,
5831, 5863, 5911, 5922, 5924, 5937, 5982, 6082, 6136,
6233, 6239, 6256, 6261, 6283, 6289, 6304, 6319, 6447,
6512, 6515, 6529, 6558, 6611, 6643, 6652, 6663, 6697,
6843, 6852, 6887, 6892, 6930, 7020, 7022, 7101, 7107,
7157, 7217, 7299, 7381, 7551, 7594, 7613, 7614, 7834,
7860, 7914, 7959, 7962, 8007, 8051, 8074, 8154, 8197,
8214, 8258, 8280, 8298, 8383, 8449, 8484, 8512, 8634,
8738, 8781, 8789, 8808, 8881, 8922, 8935, 8963, 9001,
9042, 9053, 9060, 9117, 9120, 9150, 9187, 9192, 9228,
9237, 9249, 9284, 9290, 9350, 9373, 9567, 9585, 9614,
9675, 9783, 9831, 9876, 9919, 10098, 10134, 10183, 10249,
10257, 10320, 10326, 10344, 10386, 10455, 10550, 10627, 10645,
10655, 10698, 10712, 10727, 10733, 10750, 10919, 10940, 10996,
11030, 11036, 11063, 11071, 11098, 11187, 11197, 11223, 11257,
11283, 11367, 11462, 11465, 11578, 11602, 11762, 11783, 11818,
11820, 11831, 11884, 11913, 12073, 12143, 12192, 12253, 12329,
12493, 12532, 12535, 12548, 12569, 12618, 12632, 12663, 12826,
13006, 13009, 13096, 13133, 13212, 13235, 13291, 13297, 13327,
13376, 13550, 13661, 13758, 13829, 13851, 13971, 13972, 14040,
14053, 14068, 14114, 14144, 14206, 14267, 14270, 14271, 14294,
14327, 14407, 14436, 14458, 14470, 14479, 14535, 14561, 14582,
14670, 14714, 14729, 14831, 14832, 14838, 14843, 14996, 15011,
15069, 15110, 15162, 15181, 15191, 15231, 15251, 15271, 15299,
15308, 15345, 15376, 15456, 15507, 15515, 15583, 15608, 15624,
15678, 15732, 15813, 15820, 15825, 15849, 15861, 15882, 15885,
15887, 15895, 15908, 16031, 16077, 16101, 16231, 16307, 16410,
16426, 16522, 16594, 16617, 16631, 16684, 16704, 16763, 16768,
16775, 16811, 16813, 16821, 16824, 16879, 16898, 16980, 17037,
17116, 17158, 17165, 17179, 17217, 17241, 17274, 17321, 17366,
17398, 17407, 17410, 17454, 17470, 17472, 17547, 17741, 17840,
17891, 17958, 18020, 18048, 18186, 18257, 18259, 18276, 18288,
18404, 18575, 18578, 18582, 18614, 18671, 18673, 18773, 18864,
18933, 18936, 18937, 18966, 19005], dtype=int64),)
VehicleSpeedVariation
(array([ 48, 97, 195, 204, 224, 363, 489, 527, 541,
692, 1047, 1111, 1168, 1283, 1514, 1676, 1697, 1707,
1792, 1793, 1953, 1971, 2260, 2378, 2436, 2472, 2552,
2562, 2777, 2884, 3058, 3069, 3094, 3109, 3127, 3157,
3245, 3264, 3434, 3474, 3582, 3636, 3657, 3791, 3811,
3850, 4038, 4089, 4143, 4484, 4504, 4526, 4582, 4620,
4627, 4643, 4872, 5033, 5127, 5197, 5277, 5282, 5295,
5328, 5532, 5542, 5573, 6018, 6163, 6177, 6363, 6404,
6418, 6436, 6471, 6581, 6587, 6662, 6706, 6730, 6904,
6959, 6973, 7040, 7146, 7159, 7441, 7456, 7509, 7537,
7568, 7706, 7714, 7715, 7879, 7900, 7920, 7994, 8013,
8044, 8085, 8094, 8121, 8196, 8243, 8343, 8353, 8358,
8389, 8572, 8787, 8824, 8892, 8922, 8956, 8985, 9102,
9120, 9133, 9195, 9383, 9556, 9569, 9599, 9616, 9650,
9710, 9713, 9918, 9931, 10016, 10022, 10234, 10262, 10268,
10400, 10473, 10517, 10544, 10559, 10742, 10824, 10868, 10990,
11117, 11177, 11293, 11457, 11544, 11623, 11786, 11832, 11884,
12023, 12029, 12176, 12343, 12420, 12622, 12699, 12801, 12870,
12899, 12944, 13048, 13076, 13153, 13154, 13170, 13259, 13461,
13675, 13740, 13769, 13825, 13961, 13999, 14146, 14165, 14440,
14541, 14603, 14645, 14658, 14670, 14810, 15150, 15164, 15182,
15218, 15236, 15326, 15329, 15395, 15450, 15552, 15640, 15730,
15751, 15848, 15877, 15933, 16001, 16067, 16085, 16279, 16368,
16526, 16623, 16841, 16918, 16929, 17009, 17093, 17136, 17203,
17357, 17364, 17374, 17522, 17771, 17803, 17909, 17939, 17974,
18049, 18056, 18087, 18119, 18337, 18442, 18453, 18651, 18674,
18944], dtype=int64),)
LongitudinalAcceleration
(array([ 25, 71, 81, 116, 165, 252, 289, 294, 298,
311, 335, 428, 456, 578, 605, 626, 634, 652,
684, 706, 740, 755, 769, 790, 799, 803, 814,
817, 861, 917, 934, 940, 949, 992, 1023, 1182,
1227, 1246, 1282, 1297, 1317, 1335, 1372, 1436, 1450,
1486, 1490, 1524, 1570, 1589, 1612, 1614, 1621, 1651,
1653, 1663, 1693, 1706, 1710, 1748, 1755, 1780, 1816,
1828, 1838, 1839, 1894, 1908, 1909, 1932, 1942, 2024,
2047, 2084, 2089, 2127, 2148, 2227, 2297, 2353, 2404,
2407, 2415, 2424, 2435, 2451, 2456, 2457, 2466, 2518,
2589, 2600, 2657, 2664, 2700, 2704, 2713, 2747, 2770,
2802, 2851, 2855, 2866, 2889, 2893, 2924, 2931, 3006,
3007, 3018, 3025, 3038, 3045, 3048, 3065, 3080, 3090,
3092, 3138, 3177, 3182, 3206, 3222, 3276, 3285, 3307,
3330, 3352, 3403, 3405, 3407, 3409, 3428, 3450, 3459,
3522, 3583, 3588, 3663, 3667, 3679, 3685, 3699, 3710,
3711, 3722, 3740, 3752, 3792, 3798, 3864, 3866, 3886,
3892, 3896, 3953, 4017, 4021, 4044, 4063, 4080, 4109,
4165, 4184, 4202, 4214, 4283, 4372, 4374, 4416, 4538,
4594, 4612, 4671, 4691, 4699, 4789, 4840, 4872, 4878,
4884, 4961, 4991, 5000, 5032, 5036, 5081, 5082, 5090,
5119, 5124, 5128, 5139, 5143, 5159, 5210, 5215, 5219,
5221, 5238, 5256, 5297, 5300, 5341, 5371, 5443, 5468,
5478, 5499, 5540, 5635, 5681, 5736, 5760, 5762, 5772,
5778, 5786, 5799, 5822, 5826, 5867, 5896, 5963, 5980,
5999, 6019, 6061, 6065, 6093, 6103, 6129, 6152, 6187,
6250, 6263, 6303, 6321, 6351, 6390, 6436, 6465, 6481,
6520, 6552, 6595, 6612, 6632, 6634, 6673, 6683, 6691,
6713, 6720, 6746, 6797, 6801, 6813, 6849, 6858, 6977,
6997, 7013, 7031, 7047, 7050, 7095, 7139, 7148, 7201,
7209, 7232, 7245, 7294, 7343, 7349, 7410, 7515, 7516,
7534, 7588, 7641, 7654, 7697, 7709, 7740, 7744, 7754,
7764, 7771, 7785, 7802, 7838, 7877, 7922, 8045, 8047,
8143, 8181, 8186, 8193, 8194, 8218, 8223, 8259, 8266,
8269, 8339, 8384, 8396, 8434, 8464, 8465, 8496, 8516,
8561, 8572, 8573, 8620, 8625, 8674, 8815, 8828, 8848,
8851, 8863, 8867, 8872, 8887, 8905, 8908, 8978, 8983,
9051, 9067, 9068, 9112, 9183, 9190, 9248, 9302, 9323,
9328, 9348, 9351, 9404, 9412, 9458, 9483, 9498, 9501,
9544, 9559, 9581, 9590, 9665, 9670, 9686, 9694, 9717,
9719, 9745, 9749, 9785, 9787, 9816, 9863, 9882, 9917,
9945, 9962, 9978, 9986, 10062, 10097, 10123, 10187, 10200,
10221, 10274, 10315, 10319, 10321, 10327, 10375, 10405, 10421,
10438, 10440, 10454, 10456, 10465, 10488, 10493, 10522, 10553,
10571, 10593, 10601, 10620, 10629, 10671, 10674, 10678, 10680,
10730, 10741, 10745, 10798, 10836, 10840, 10868, 10882, 10891,
10894, 10921, 10943, 10962, 10963, 11003, 11004, 11035, 11078,
11087, 11160, 11191, 11196, 11215, 11248, 11262, 11270, 11271,
11278, 11286, 11356, 11414, 11420, 11475, 11501, 11514, 11540,
11551, 11559, 11565, 11645, 11646, 11679, 11695, 11703, 11706,
11717, 11724, 11739, 11756, 11854, 11856, 11878, 11925, 11995,
12066, 12085, 12124, 12136, 12181, 12210, 12220, 12248, 12290,
12365, 12386, 12388, 12401, 12503, 12515, 12524, 12538, 12550,
12564, 12584, 12601, 12623, 12636, 12641, 12643, 12655, 12672,
12731, 12748, 12790, 12829, 12873, 12935, 12937, 12962, 12966,
12978, 12979, 12997, 13003, 13013, 13016, 13076, 13103, 13130,
13157, 13188, 13208, 13300, 13312, 13318, 13335, 13358, 13380,
13396, 13448, 13451, 13461, 13472, 13488, 13489, 13499, 13502,
13587, 13599, 13624, 13644, 13684, 13703, 13712, 13724, 13752,
13756, 13760, 13786, 13801, 13902, 13903, 13988, 14076, 14107,
14145, 14183, 14193, 14214, 14224, 14232, 14288, 14298, 14299,
14314, 14318, 14330, 14378, 14394, 14405, 14417, 14424, 14433,
14441, 14484, 14487, 14509, 14566, 14579, 14588, 14605, 14624,
14643, 14665, 14704, 14715, 14735, 14744, 14800, 14801, 14835,
14840, 14880, 14888, 14890, 14931, 14947, 14979, 15084, 15130,
15157, 15164, 15169, 15173, 15183, 15190, 15253, 15261, 15304,
15306, 15332, 15368, 15386, 15408, 15412, 15413, 15471, 15481,
15537, 15569, 15641, 15714, 15740, 15767, 15774, 15898, 15914,
15926, 15939, 15946, 15961, 15971, 15990, 15994, 16019, 16027,
16038, 16043, 16044, 16045, 16056, 16068, 16084, 16095, 16124,
16168, 16176, 16194, 16230, 16242, 16311, 16313, 16326, 16347,
16351, 16388, 16406, 16444, 16446, 16449, 16463, 16474, 16513,
16523, 16560, 16570, 16606, 16654, 16701, 16734, 16776, 16815,
16828, 16914, 16956, 16989, 16995, 16999, 17005, 17082, 17089,
17101, 17123, 17146, 17159, 17204, 17259, 17289, 17304, 17310,
17334, 17359, 17393, 17425, 17465, 17492, 17532, 17559, 17563,
17613, 17629, 17644, 17742, 17763, 17843, 17976, 17987, 17996,
18025, 18059, 18080, 18144, 18167, 18242, 18260, 18275, 18285,
18334, 18389, 18422, 18432, 18465, 18466, 18474, 18517, 18523,
18527, 18540, 18581, 18602, 18608, 18618, 18648, 18688, 18697,
18722, 18738, 18749, 18762, 18822, 18912, 18956, 19011],
dtype=int64),)
EngineLoad
(array([], dtype=int64),)
EngineCoolantTemperature
(array([], dtype=int64),)
ManifoldAbsolutePressure
(array([], dtype=int64),)
EngineRPM
(array([], dtype=int64),)
MassAirFlow
(array([], dtype=int64),)
IntakeAirTemperature
(array([], dtype=int64),)
VerticalAcceleration
(array([ 25, 71, 81, 116, 165, 252, 289, 294, 298,
311, 335, 428, 456, 578, 605, 626, 634, 652,
684, 706, 740, 755, 769, 790, 799, 803, 814,
817, 861, 917, 934, 940, 949, 992, 1023, 1182,
1227, 1246, 1282, 1297, 1317, 1335, 1372, 1436, 1450,
1486, 1490, 1524, 1570, 1589, 1612, 1614, 1621, 1651,
1653, 1663, 1693, 1706, 1710, 1748, 1755, 1780, 1816,
1828, 1838, 1839, 1894, 1908, 1909, 1932, 1942, 2024,
2047, 2084, 2089, 2127, 2148, 2227, 2297, 2353, 2404,
2407, 2415, 2424, 2435, 2451, 2456, 2457, 2466, 2518,
2589, 2600, 2657, 2664, 2700, 2704, 2713, 2747, 2770,
2802, 2851, 2855, 2866, 2889, 2893, 2924, 2931, 3006,
3007, 3018, 3025, 3038, 3045, 3048, 3065, 3080, 3090,
3092, 3138, 3177, 3182, 3206, 3222, 3276, 3285, 3307,
3330, 3352, 3403, 3405, 3407, 3409, 3428, 3450, 3459,
3522, 3583, 3588, 3663, 3667, 3679, 3685, 3699, 3710,
3711, 3722, 3740, 3752, 3792, 3798, 3864, 3866, 3886,
3892, 3896, 3953, 4017, 4021, 4044, 4063, 4080, 4109,
4165, 4184, 4202, 4214, 4283, 4372, 4374, 4416, 4538,
4594, 4612, 4671, 4691, 4699, 4789, 4840, 4872, 4878,
4884, 4961, 4991, 5000, 5032, 5036, 5081, 5082, 5090,
5119, 5124, 5128, 5139, 5143, 5159, 5210, 5215, 5219,
5221, 5238, 5256, 5297, 5300, 5341, 5371, 5443, 5468,
5478, 5499, 5540, 5635, 5681, 5736, 5760, 5762, 5772,
5778, 5786, 5799, 5822, 5826, 5867, 5896, 5963, 5980,
5999, 6019, 6061, 6065, 6093, 6103, 6129, 6152, 6187,
6250, 6263, 6303, 6321, 6351, 6390, 6436, 6465, 6481,
6520, 6552, 6595, 6612, 6632, 6634, 6673, 6683, 6691,
6713, 6720, 6746, 6797, 6801, 6813, 6849, 6858, 6977,
6997, 7013, 7031, 7047, 7050, 7095, 7139, 7148, 7201,
7209, 7232, 7245, 7294, 7343, 7349, 7410, 7515, 7516,
7534, 7588, 7641, 7654, 7697, 7709, 7740, 7744, 7754,
7764, 7771, 7785, 7802, 7838, 7877, 7922, 8045, 8047,
8143, 8181, 8186, 8193, 8194, 8218, 8223, 8259, 8266,
8269, 8339, 8384, 8396, 8434, 8464, 8465, 8496, 8516,
8561, 8572, 8573, 8620, 8625, 8674, 8815, 8828, 8848,
8851, 8863, 8867, 8872, 8887, 8905, 8908, 8978, 8983,
9051, 9067, 9068, 9112, 9183, 9190, 9248, 9302, 9323,
9328, 9348, 9351, 9404, 9412, 9458, 9483, 9498, 9501,
9544, 9559, 9581, 9590, 9665, 9670, 9686, 9694, 9717,
9719, 9745, 9749, 9785, 9787, 9816, 9863, 9882, 9917,
9945, 9962, 9978, 9986, 10062, 10097, 10123, 10187, 10200,
10221, 10274, 10315, 10319, 10321, 10327, 10375, 10405, 10421,
10438, 10440, 10454, 10456, 10465, 10488, 10493, 10522, 10553,
10571, 10593, 10601, 10620, 10629, 10671, 10674, 10678, 10680,
10730, 10741, 10745, 10798, 10836, 10840, 10868, 10882, 10891,
10894, 10921, 10943, 10962, 10963, 11003, 11004, 11035, 11078,
11087, 11160, 11191, 11196, 11215, 11248, 11262, 11270, 11271,
11278, 11286, 11356, 11414, 11420, 11475, 11501, 11514, 11540,
11551, 11559, 11565, 11645, 11646, 11679, 11695, 11703, 11706,
11717, 11724, 11739, 11756, 11854, 11856, 11878, 11925, 11995,
12066, 12085, 12124, 12136, 12181, 12210, 12220, 12248, 12290,
12365, 12386, 12388, 12401, 12503, 12515, 12524, 12538, 12550,
12564, 12584, 12601, 12623, 12636, 12641, 12643, 12655, 12672,
12731, 12748, 12790, 12829, 12873, 12935, 12937, 12962, 12966,
12978, 12979, 12997, 13003, 13013, 13016, 13076, 13103, 13130,
13157, 13188, 13208, 13300, 13312, 13318, 13335, 13358, 13380,
13396, 13448, 13451, 13461, 13472, 13488, 13489, 13499, 13502,
13587, 13599, 13624, 13644, 13684, 13703, 13712, 13724, 13752,
13756, 13760, 13786, 13801, 13902, 13903, 13988, 14076, 14107,
14145, 14183, 14193, 14214, 14224, 14232, 14288, 14298, 14299,
14314, 14318, 14330, 14378, 14394, 14405, 14417, 14424, 14433,
14441, 14484, 14487, 14509, 14566, 14579, 14588, 14605, 14624,
14643, 14665, 14704, 14715, 14735, 14744, 14800, 14801, 14835,
14840, 14880, 14888, 14890, 14931, 14947, 14979, 15084, 15130,
15157, 15164, 15169, 15173, 15183, 15190, 15253, 15261, 15304,
15306, 15332, 15368, 15386, 15408, 15412, 15413, 15471, 15481,
15537, 15569, 15641, 15714, 15740, 15767, 15774, 15898, 15914,
15926, 15939, 15946, 15961, 15971, 15990, 15994, 16019, 16027,
16038, 16043, 16044, 16045, 16056, 16068, 16084, 16095, 16124,
16168, 16176, 16194, 16230, 16242, 16311, 16313, 16326, 16347,
16351, 16388, 16406, 16444, 16446, 16449, 16463, 16474, 16513,
16523, 16560, 16570, 16606, 16654, 16701, 16734, 16776, 16815,
16828, 16914, 16956, 16989, 16995, 16999, 17005, 17082, 17089,
17101, 17123, 17146, 17159, 17204, 17259, 17289, 17304, 17310,
17334, 17359, 17393, 17425, 17465, 17492, 17532, 17559, 17563,
17613, 17629, 17644, 17742, 17763, 17843, 17976, 17987, 17996,
18025, 18059, 18080, 18144, 18167, 18242, 18260, 18275, 18285,
18334, 18389, 18422, 18432, 18465, 18466, 18474, 18517, 18523,
18527, 18540, 18581, 18602, 18608, 18618, 18648, 18688, 18697,
18722, 18738, 18749, 18762, 18822, 18912, 18956, 19011],
dtype=int64),)
FuelConsumptionAverage
(array([], dtype=int64),)
We can see that AltitudeVariation, VehicleSpeedVariance, VehicleSpeedVariation, LongitudinalAcceleration and VerticalAcceleration all have a good deal of outliers. Now let's visualize the distributions of each feature.
X_train_num.hist(bins=50, figsize=(20,15))
array([[<AxesSubplot:title={'center':'AltitudeVariation'}>,
<AxesSubplot:title={'center':'VehicleSpeedInstantaneous'}>,
<AxesSubplot:title={'center':'VehicleSpeedAverage'}>,
<AxesSubplot:title={'center':'VehicleSpeedVariance'}>],
[<AxesSubplot:title={'center':'VehicleSpeedVariation'}>,
<AxesSubplot:title={'center':'LongitudinalAcceleration'}>,
<AxesSubplot:title={'center':'EngineLoad'}>,
<AxesSubplot:title={'center':'EngineCoolantTemperature'}>],
[<AxesSubplot:title={'center':'ManifoldAbsolutePressure'}>,
<AxesSubplot:title={'center':'EngineRPM'}>,
<AxesSubplot:title={'center':'MassAirFlow'}>,
<AxesSubplot:title={'center':'IntakeAirTemperature'}>],
[<AxesSubplot:title={'center':'VerticalAcceleration'}>,
<AxesSubplot:title={'center':'FuelConsumptionAverage'}>,
<AxesSubplot:>, <AxesSubplot:>]], dtype=object)
orig_dist = pd.DataFrame(X_train_num.agg(['skew', 'kurtosis']).transpose())
orig_dist = orig_dist.add_prefix('original_')
orig_dist #the skewness/kurtosis before any transformation
| original_skew | original_kurtosis | |
|---|---|---|
| AltitudeVariation | -0.402307 | 8.401583 |
| VehicleSpeedInstantaneous | 0.837620 | -0.247995 |
| VehicleSpeedAverage | 1.099062 | 0.181004 |
| VehicleSpeedVariance | 2.135534 | 6.000058 |
| VehicleSpeedVariation | -0.990545 | 267.865298 |
| LongitudinalAcceleration | 2.597981 | 8.785024 |
| EngineLoad | 0.383122 | -0.856801 |
| EngineCoolantTemperature | -1.445955 | 1.001245 |
| ManifoldAbsolutePressure | 2.298943 | 6.427249 |
| EngineRPM | 0.209905 | -1.122746 |
| MassAirFlow | 0.877312 | 0.901807 |
| IntakeAirTemperature | 1.071027 | 0.359724 |
| VerticalAcceleration | 4.573195 | 19.753418 |
| FuelConsumptionAverage | 0.671023 | 0.426473 |
#taking out these columns because of negative values
X_train_to_transform = X_train_num.drop(['VerticalAcceleration', 'AltitudeVariation','LongitudinalAcceleration',
'VehicleSpeedVariation'], axis=1)
df_sqrt = X_train_to_transform.applymap(lambda x: np.sqrt(x)) #getting the square root transformation
df_sqrt.hist(bins=50, figsize=(20,15))
array([[<AxesSubplot:title={'center':'VehicleSpeedInstantaneous'}>,
<AxesSubplot:title={'center':'VehicleSpeedAverage'}>,
<AxesSubplot:title={'center':'VehicleSpeedVariance'}>],
[<AxesSubplot:title={'center':'EngineLoad'}>,
<AxesSubplot:title={'center':'EngineCoolantTemperature'}>,
<AxesSubplot:title={'center':'ManifoldAbsolutePressure'}>],
[<AxesSubplot:title={'center':'EngineRPM'}>,
<AxesSubplot:title={'center':'MassAirFlow'}>,
<AxesSubplot:title={'center':'IntakeAirTemperature'}>],
[<AxesSubplot:title={'center':'FuelConsumptionAverage'}>,
<AxesSubplot:>, <AxesSubplot:>]], dtype=object)
df_log = X_train_to_transform.applymap(lambda x: np.log(x+1)) # log transformation (+1 for zero values)
df_log.hist(bins=50, figsize=(20,15))
array([[<AxesSubplot:title={'center':'VehicleSpeedInstantaneous'}>,
<AxesSubplot:title={'center':'VehicleSpeedAverage'}>,
<AxesSubplot:title={'center':'VehicleSpeedVariance'}>],
[<AxesSubplot:title={'center':'EngineLoad'}>,
<AxesSubplot:title={'center':'EngineCoolantTemperature'}>,
<AxesSubplot:title={'center':'ManifoldAbsolutePressure'}>],
[<AxesSubplot:title={'center':'EngineRPM'}>,
<AxesSubplot:title={'center':'MassAirFlow'}>,
<AxesSubplot:title={'center':'IntakeAirTemperature'}>],
[<AxesSubplot:title={'center':'FuelConsumptionAverage'}>,
<AxesSubplot:>, <AxesSubplot:>]], dtype=object)
# EngineCoolantTemperature is negatively skewed
X_train_num['EngineCoolantTemperature'] = np.log(max(X_train_num['EngineCoolantTemperature'] + 1) - X_train_num['EngineCoolantTemperature'])
neg_log = pd.DataFrame(X_train_num['EngineCoolantTemperature'].agg(['skew', 'kurtosis']).transpose())
neg_log
| EngineCoolantTemperature | |
|---|---|
| skew | 0.672345 |
| kurtosis | -0.602490 |
X_train_num['EngineCoolantTemperature'].hist(bins=50, figsize=(8,6))
<AxesSubplot:>
from sklearn.preprocessing import PowerTransformer
df_power = X_train_num.copy()
pt = PowerTransformer()
df_power = pt.fit_transform(df_power)
df_power = pd.DataFrame(df_power, columns=X_train_num.columns)
df_power.hist(bins=50, figsize=(20,15))
array([[<AxesSubplot:title={'center':'AltitudeVariation'}>,
<AxesSubplot:title={'center':'VehicleSpeedInstantaneous'}>,
<AxesSubplot:title={'center':'VehicleSpeedAverage'}>,
<AxesSubplot:title={'center':'VehicleSpeedVariance'}>],
[<AxesSubplot:title={'center':'VehicleSpeedVariation'}>,
<AxesSubplot:title={'center':'LongitudinalAcceleration'}>,
<AxesSubplot:title={'center':'EngineLoad'}>,
<AxesSubplot:title={'center':'EngineCoolantTemperature'}>],
[<AxesSubplot:title={'center':'ManifoldAbsolutePressure'}>,
<AxesSubplot:title={'center':'EngineRPM'}>,
<AxesSubplot:title={'center':'MassAirFlow'}>,
<AxesSubplot:title={'center':'IntakeAirTemperature'}>],
[<AxesSubplot:title={'center':'VerticalAcceleration'}>,
<AxesSubplot:title={'center':'FuelConsumptionAverage'}>,
<AxesSubplot:>, <AxesSubplot:>]], dtype=object)
sqrt_dist = pd.DataFrame(df_sqrt.agg(['skew', 'kurtosis']).transpose())
sqrt_dist = sqrt_dist.add_prefix('sqrt_')
log_dist = pd.DataFrame(df_log.agg(['skew', 'kurtosis']).transpose())
log_dist = log_dist.add_prefix('log_')
power_dist = pd.DataFrame(df_power.agg(['skew', 'kurtosis']).transpose())
power_dist = power_dist.add_prefix('power_')
compare_tab = pd.concat([orig_dist, sqrt_dist, log_dist, power_dist], axis=1)
compare_tab
| original_skew | original_kurtosis | sqrt_skew | sqrt_kurtosis | log_skew | log_kurtosis | power_skew | power_kurtosis | |
|---|---|---|---|---|---|---|---|---|
| AltitudeVariation | -0.402307 | 8.401583 | NaN | NaN | NaN | NaN | -0.289393 | 7.812031 |
| VehicleSpeedInstantaneous | 0.837620 | -0.247995 | -0.206263 | -0.773965 | -1.033898 | -0.148287 | -0.224737 | -0.842931 |
| VehicleSpeedAverage | 1.099062 | 0.181004 | 0.341360 | -0.419481 | -0.824222 | 1.159783 | -0.023818 | -0.336328 |
| VehicleSpeedVariance | 2.135534 | 6.000058 | 0.730340 | 0.488030 | -0.945327 | 1.278815 | -0.025494 | -0.193920 |
| VehicleSpeedVariation | -0.990545 | 267.865298 | NaN | NaN | NaN | NaN | 1.143763 | 277.054614 |
| LongitudinalAcceleration | 2.597981 | 8.785024 | NaN | NaN | NaN | NaN | -0.100434 | 2.208256 |
| EngineLoad | 0.383122 | -0.856801 | -0.620745 | -0.464599 | -1.343448 | 0.519683 | -0.315556 | -0.769679 |
| EngineCoolantTemperature | -1.445955 | 1.001245 | 0.358388 | 0.335942 | 0.298515 | 0.074850 | -0.035853 | 2.036621 |
| ManifoldAbsolutePressure | 2.298943 | 6.427249 | 2.003956 | 4.456045 | 1.763050 | 3.048469 | 0.469551 | -1.058764 |
| EngineRPM | 0.209905 | -1.122746 | -0.124027 | -0.664212 | -4.172463 | 59.059662 | -0.050869 | -0.923728 |
| MassAirFlow | 0.877312 | 0.901807 | 0.213216 | -0.632682 | -0.245412 | -0.935487 | -0.041334 | -0.898636 |
| IntakeAirTemperature | 1.071027 | 0.359724 | 0.719197 | -0.349151 | 0.380846 | -0.669552 | 0.044870 | -0.671133 |
| VerticalAcceleration | 4.573195 | 19.753418 | NaN | NaN | NaN | NaN | -0.651690 | 4.254750 |
| FuelConsumptionAverage | 0.671023 | 0.426473 | 0.323957 | -0.401449 | 0.055133 | -0.683098 | 0.005875 | -0.704331 |
PowerTransformer seems to be the most effective transform for reducing skewness for all columns (although it does increase kurtosis for VehicleSpeedInstantaneous, VehicleSpeedVariation, EngineCoolantTemperature, IntakeAirTemperature and FuelConsumptionAverage.)
for column in df_power:
z = np.abs(st.zscore(df_power[column])) # getting the z-scores for all columns.
print(column)
print(np.where(z > 3)) #checking if there are any outliers with a z-score > 3
AltitudeVariation
(array([ 88, 165, 247, 281, 314, 355, 531, 599, 931,
981, 1394, 1400, 1478, 1807, 1869, 1894, 1900, 1980,
1996, 2100, 2198, 2346, 2363, 2380, 2489, 2576, 2668,
2681, 2689, 2732, 2788, 2815, 2839, 2859, 2989, 3065,
3214, 3279, 3407, 3409, 3513, 3539, 3567, 3886, 3908,
3924, 4067, 4090, 4161, 4166, 4180, 4202, 4345, 4411,
4489, 4606, 4651, 4700, 4864, 4887, 4920, 5042, 5124,
5173, 5299, 5308, 5434, 5445, 5474, 5635, 5675, 5774,
5806, 5864, 5871, 5937, 6014, 6021, 6049, 6063, 6112,
6235, 6300, 6412, 6415, 6426, 6478, 6481, 6552, 6710,
6748, 6897, 7025, 7094, 7147, 7184, 7202, 7263, 7338,
7362, 7549, 7828, 7943, 7948, 7961, 8002, 8068, 8084,
8197, 8211, 8330, 8362, 8391, 8503, 8936, 8961, 8976,
9053, 9070, 9092, 9132, 9176, 9198, 9257, 9323, 9328,
9597, 9663, 9671, 9682, 9685, 9715, 9809, 9859, 9920,
9991, 10015, 10078, 10107, 10147, 10156, 10161, 10239, 10259,
10398, 10489, 10512, 10523, 10578, 10618, 10693, 10788, 11084,
11089, 11095, 11103, 11105, 11147, 11280, 11312, 11324, 11403,
11459, 11667, 11727, 11728, 11785, 11787, 11792, 11893, 12056,
12291, 12355, 12452, 12662, 12677, 12766, 12815, 12875, 12947,
13013, 13027, 13082, 13184, 13229, 13252, 13306, 13321, 13437,
13483, 13505, 13619, 13753, 13756, 13881, 13900, 13941, 13977,
14205, 14232, 14267, 14305, 14306, 14352, 14513, 14519, 14542,
14596, 14776, 14835, 14865, 14873, 14880, 14901, 14902, 14923,
14954, 15056, 15138, 15238, 15291, 15368, 15410, 15433, 15591,
15607, 15932, 15991, 16141, 16173, 16190, 16194, 16225, 16270,
16281, 16308, 16397, 16417, 16418, 16466, 16498, 16609, 16610,
16643, 16678, 16698, 16745, 16900, 17044, 17070, 17176, 17220,
17326, 17379, 17406, 17435, 17672, 17681, 17721, 17749, 17765,
17888, 17971, 18108, 18161, 18167, 18171, 18188, 18191, 18288,
18338, 18355, 18455, 18622, 18721, 18732, 18810, 18812, 18822,
18849, 18946, 18970, 18983], dtype=int64),)
VehicleSpeedInstantaneous
(array([], dtype=int64),)
VehicleSpeedAverage
(array([], dtype=int64),)
VehicleSpeedVariance
(array([ 505, 1469, 16768], dtype=int64),)
VehicleSpeedVariation
(array([ 48, 97, 195, 204, 224, 489, 527, 541, 692,
1047, 1283, 1514, 1676, 1697, 1707, 1792, 1793, 1953,
1971, 2260, 2378, 2436, 2446, 2472, 2777, 2884, 3058,
3069, 3094, 3109, 3127, 3157, 3245, 3264, 3434, 3474,
3582, 3636, 3657, 3791, 3811, 4038, 4089, 4143, 4275,
4484, 4504, 4526, 4582, 4620, 4627, 4643, 5033, 5127,
5197, 5277, 5282, 5295, 5328, 5532, 5542, 5573, 6018,
6163, 6177, 6363, 6404, 6418, 6471, 6581, 6587, 6662,
6706, 6730, 6878, 6904, 6959, 6973, 7040, 7146, 7159,
7441, 7456, 7509, 7537, 7568, 7611, 7706, 7714, 7715,
7879, 7900, 7994, 8013, 8044, 8085, 8094, 8121, 8196,
8243, 8343, 8353, 8358, 8389, 8572, 8631, 8787, 8824,
8892, 8922, 8956, 8985, 9102, 9120, 9133, 9195, 9383,
9556, 9569, 9599, 9616, 9650, 9710, 9713, 9918, 9931,
10016, 10022, 10128, 10234, 10262, 10268, 10400, 10473, 10517,
10544, 10559, 10742, 10824, 10868, 10990, 11117, 11177, 11293,
11544, 11623, 11786, 11832, 11884, 11930, 12023, 12029, 12176,
12343, 12420, 12622, 12699, 12801, 12870, 12899, 12944, 13048,
13076, 13153, 13154, 13170, 13259, 13461, 13675, 13740, 13769,
13825, 13961, 13999, 14146, 14440, 14541, 14603, 14645, 14658,
14670, 14810, 15150, 15164, 15182, 15218, 15236, 15326, 15329,
15395, 15450, 15552, 15640, 15751, 15848, 15877, 15933, 16001,
16067, 16085, 16279, 16368, 16526, 16623, 16841, 16918, 16929,
17009, 17093, 17136, 17203, 17357, 17364, 17374, 17522, 17771,
17803, 17909, 17939, 18049, 18056, 18087, 18119, 18337, 18442,
18453, 18651, 18674, 18944], dtype=int64),)
LongitudinalAcceleration
(array([ 3, 233, 471, 714, 984, 1036, 1044, 1070, 1130,
1258, 1265, 1306, 1324, 1493, 1525, 1606, 1857, 1948,
2161, 2202, 2286, 2656, 2737, 2772, 3501, 3829, 3960,
3989, 4003, 4033, 4105, 4304, 4548, 4626, 4726, 5198,
5201, 5276, 5445, 5591, 5771, 5930, 6040, 6209, 6381,
6436, 6583, 6705, 6831, 6981, 7685, 7772, 7919, 7962,
8033, 8163, 8289, 8551, 8782, 8838, 8887, 8948, 9184,
9365, 9403, 9879, 9985, 10174, 10433, 10524, 10758, 11002,
11034, 11578, 11600, 11651, 11798, 12001, 12189, 12388, 12641,
12907, 12947, 13003, 13019, 13046, 13099, 13291, 13737, 13857,
13968, 14728, 14834, 14971, 15110, 15235, 15372, 15462, 15582,
15782, 15902, 15919, 15968, 16149, 16275, 16566, 16745, 16781,
16840, 17366, 17486, 18300, 18394, 18461, 18776, 18797],
dtype=int64),)
EngineLoad
(array([], dtype=int64),)
EngineCoolantTemperature
(array([], dtype=int64),)
ManifoldAbsolutePressure
(array([], dtype=int64),)
EngineRPM
(array([], dtype=int64),)
MassAirFlow
(array([], dtype=int64),)
IntakeAirTemperature
(array([], dtype=int64),)
VerticalAcceleration
(array([ 38, 122, 129, 435, 643, 654, 727, 775, 780,
892, 956, 977, 979, 999, 1178, 1299, 1307, 1347,
1396, 1517, 1528, 1551, 1556, 1689, 1705, 1767, 1792,
1953, 1985, 2107, 2167, 2260, 2267, 2378, 2382, 2474,
2506, 2546, 2598, 2841, 3260, 3356, 3507, 3528, 3579,
3605, 3797, 3812, 3849, 4005, 4037, 4067, 4070, 4198,
4335, 4351, 4462, 4487, 4491, 4541, 4592, 4603, 4792,
4806, 4859, 4862, 4868, 4900, 4990, 5015, 5071, 5114,
5122, 5174, 5197, 5450, 5519, 5542, 5568, 5571, 5617,
5744, 5823, 5825, 5924, 6008, 6074, 6082, 6091, 6098,
6123, 6241, 6283, 6347, 6367, 6396, 6431, 6489, 6512,
6605, 6657, 6697, 6725, 6766, 6852, 6892, 6951, 7026,
7104, 7141, 7157, 7319, 7594, 7659, 7681, 7713, 7898,
7908, 7959, 7965, 7973, 8055, 8085, 8242, 8270, 8298,
8491, 8506, 8606, 8624, 8634, 8771, 8792, 8922, 9081,
9173, 9294, 9311, 9395, 9414, 9565, 9640, 9966, 10004,
10032, 10234, 10268, 10386, 10544, 10621, 10627, 10750, 10753,
10893, 10931, 10938, 10977, 11009, 11143, 11155, 11222, 11380,
11462, 11467, 11527, 11608, 11609, 11650, 12030, 12199, 12231,
12233, 12532, 12598, 12696, 12704, 12841, 12870, 12888, 12911,
12967, 13202, 13402, 13407, 13512, 13666, 13762, 13781, 13836,
13859, 13993, 14053, 14218, 14270, 14293, 14294, 14518, 14568,
14582, 14628, 14701, 14810, 14832, 14988, 15005, 15095, 15122,
15125, 15252, 15405, 15515, 15578, 15715, 15820, 15895, 16048,
16365, 16631, 16788, 16841, 16908, 17045, 17109, 17228, 17295,
17300, 17314, 17376, 17404, 17488, 17741, 17820, 17852, 17855,
18016, 18190, 18280, 18522, 18542, 18639, 18710, 18719, 18733,
18773, 18888, 18944, 18953, 18975, 19017], dtype=int64),)
FuelConsumptionAverage
(array([], dtype=int64),)
PowerTransform has removed most of the outliers in the VehicleSpeedVariance and LongitudinalAcceleration columns but a good deal of the outliers in the AltitudeVariation, VehicleSpeedVariation, VerticalAcceleration columns remain, we should keep this in mind when selecting which features to train the models on.
X_train.select_dtypes(include=object).columns.tolist()
['traffic', 'drivingStyle']
X_train.select_dtypes(include=np.number).columns.tolist()
['AltitudeVariation', 'VehicleSpeedInstantaneous', 'VehicleSpeedAverage', 'VehicleSpeedVariance', 'VehicleSpeedVariation', 'LongitudinalAcceleration', 'EngineLoad', 'EngineCoolantTemperature', 'ManifoldAbsolutePressure', 'EngineRPM', 'MassAirFlow', 'IntakeAirTemperature', 'VerticalAcceleration', 'FuelConsumptionAverage']
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
cat_cols = X_train.select_dtypes(include=object).columns.tolist()
# Build numeric processor
num_cols = X_train.select_dtypes(include=np.number).columns.tolist()
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')), #filling in missing values
('power_transform', PowerTransformer()), # transforming the data
('std_scaler', StandardScaler()) # standardizing the data
])
prep_pipeline = ColumnTransformer([
("num", num_pipeline, num_cols),
("cat", OneHotEncoder(), cat_cols) #one-hot-encoding the categorical columns
])
X_train.shape
(19020, 16)
X_test.shape
(4755, 16)
X_train_prepared = prep_pipeline.fit_transform(X_train)
print(X_train_prepared.shape)
(19020, 19)
X_test_prepared = prep_pipeline.transform(X_test)
print(X_test_prepared.shape)
(4755, 19)
Let's use a random forest classifier to find the importance of each feature and from there we can decide which features to train our models with.
from sklearn.ensemble import RandomForestClassifier
# define the model
model = RandomForestClassifier(random_state=1)
# fit the model
model.fit(X_train_prepared, y_train)
# get importance
importance = model.feature_importances_
plt.figure(figsize=(20,20))
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(pd.get_dummies(X_train).columns, model.feature_importances_):
feats[feature] = importance #add the name/value pair
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='barh', figsize=(6,6), fontsize=12, title='Feature Importance')
<AxesSubplot:title={'center':'Feature Importance'}>
<Figure size 1440x1440 with 0 Axes>
importances.sort_values(by='Gini-importance', ascending=False)
| Gini-importance | |
|---|---|
| IntakeAirTemperature | 0.158953 |
| EngineCoolantTemperature | 0.156726 |
| FuelConsumptionAverage | 0.092429 |
| VehicleSpeedAverage | 0.087990 |
| traffic_Low | 0.069648 |
| ManifoldAbsolutePressure | 0.064457 |
| EngineRPM | 0.063354 |
| VerticalAcceleration | 0.049848 |
| VehicleSpeedInstantaneous | 0.044700 |
| traffic_Medium | 0.043991 |
| VehicleSpeedVariance | 0.033708 |
| LongitudinalAcceleration | 0.031665 |
| MassAirFlow | 0.030194 |
| EngineLoad | 0.024039 |
| traffic_High | 0.018967 |
| AltitudeVariation | 0.013449 |
| VehicleSpeedVariation | 0.007800 |
| drivingStyle_AggressiveStyle | 0.004909 |
| drivingStyle_EvenPaceStyle | 0.003173 |
Let's remove the five least important features (DrivingStyle, VehicleSpeedVariation, AltitudeVariation, EngineLoad and MassAirFlow) as irrelevant features can lead to overfitting and selecting the most relevant features can increase the accuracy of our models and reduce the computational time.
X_train.columns
Index(['AltitudeVariation', 'VehicleSpeedInstantaneous', 'VehicleSpeedAverage',
'VehicleSpeedVariance', 'VehicleSpeedVariation',
'LongitudinalAcceleration', 'EngineLoad', 'EngineCoolantTemperature',
'ManifoldAbsolutePressure', 'EngineRPM', 'MassAirFlow',
'IntakeAirTemperature', 'VerticalAcceleration',
'FuelConsumptionAverage', 'traffic', 'drivingStyle'],
dtype='object')
#using only the most relevant features
cat_col = ['traffic']
reduced_num_cols = ['VehicleSpeedInstantaneous', 'VehicleSpeedAverage',
'VehicleSpeedVariance', 'LongitudinalAcceleration','EngineCoolantTemperature',
'ManifoldAbsolutePressure', 'EngineRPM', 'IntakeAirTemperature', 'VerticalAcceleration','FuelConsumptionAverage']
prep_pipeline = ColumnTransformer([
("num", num_pipeline, reduced_num_cols),
("cat", OneHotEncoder(), cat_col)
])
X_train_prepared = prep_pipeline.fit_transform(X_train)
print(X_train_prepared.shape)
X_test_prepared = prep_pipeline.transform(X_test)
print(X_test_prepared.shape)
(19020, 13) (4755, 13)
We will come back to Random Forest classifiers soon, but first let's try out a Support Vector Machine (SVM) model using the Support Vector Classification (SVC) function of scikit-learn. We will first create a grid of hyperparameters for four different types of SVC kernels and then use GridSearchCV to find the best performing kernel and hyperparameters.
# imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV
# Create the parameter grid
params_grid = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]},
{'kernel':['poly'], 'C': [1, 10, 100, 1000]},
{'kernel':['sigmoid'], 'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
# Performing CV to tune parameters for best SVM fit
svm_model = GridSearchCV(SVC(), params_grid, cv=5)
svm_model.fit(X_train_prepared, y_train)
GridSearchCV(cv=5, estimator=SVC(),
param_grid=[{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
'kernel': ['rbf']},
{'C': [1, 10, 100, 1000], 'kernel': ['poly']},
{'C': [1, 10, 100, 1000], 'kernel': ['sigmoid']},
{'C': [1, 10, 100, 1000], 'kernel': ['linear']}])
# View the accuracy score
print('Best score for training data:', svm_model.best_score_,"\n")
# View the best parameters for the model found using grid search
print('Best C:',svm_model.best_estimator_.C,"\n")
print('Best Kernel:',svm_model.best_estimator_.kernel,"\n")
print('Best Gamma:',svm_model.best_estimator_.gamma,"\n")
Best score for training data: 0.9862776025236594 Best C: 1000 Best Kernel: poly Best Gamma: scale
best_svc_model = svm_model.best_estimator_
train_pred = best_svc_model.predict(X_train_prepared)
best_svc_model.score(X_train_prepared, y_train)
0.9966351209253418
# Making the confusion matrix for training set
train_cm = confusion_matrix(y_train, train_pred)
train_cm_df = pd.DataFrame(train_cm, index=best_svc_model.classes_, columns=best_svc_model.classes_)
print(train_cm_df)
# printing classification report for training set
print(classification_report(y_train, train_pred))
FullOfHolesCondition SmoothCondition UnevenCondition
FullOfHolesCondition 2586 11 2
SmoothCondition 18 11359 13
UnevenCondition 7 13 5011
precision recall f1-score support
FullOfHolesCondition 0.99 0.99 0.99 2599
SmoothCondition 1.00 1.00 1.00 11390
UnevenCondition 1.00 1.00 1.00 5031
accuracy 1.00 19020
macro avg 1.00 1.00 1.00 19020
weighted avg 1.00 1.00 1.00 19020
# final testing
test_pred = best_svc_model.predict(X_test_prepared)
# Making the confusion matrix and classification report for testing set
test_cm = confusion_matrix(y_test, test_pred)
test_cm_df = pd.DataFrame(test_cm, index=best_svc_model.classes_, columns=best_svc_model.classes_)
print(test_cm_df)
print("\n")
print(classification_report(y_test, test_pred))
print("Training set score for SVM (best estimators): %f" % best_svc_model.score(X_train_prepared , y_train))
print("Testing set score for SVM (best estimators): %f" % best_svc_model.score(X_test_prepared , y_test ))
FullOfHolesCondition SmoothCondition UnevenCondition
FullOfHolesCondition 638 6 6
SmoothCondition 20 2817 10
UnevenCondition 7 12 1239
precision recall f1-score support
FullOfHolesCondition 0.96 0.98 0.97 650
SmoothCondition 0.99 0.99 0.99 2847
UnevenCondition 0.99 0.98 0.99 1258
accuracy 0.99 4755
macro avg 0.98 0.99 0.98 4755
weighted avg 0.99 0.99 0.99 4755
Training set score for SVM (best estimators): 0.996635
Testing set score for SVM (best estimators): 0.987171
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(12, 8))
cmp = ConfusionMatrixDisplay(test_cm, display_labels=best_svc_model.classes_)
cmp.plot(ax=ax)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x268fd145250>
# Plot the confusion matrix (another style)
cmp = ConfusionMatrixDisplay(test_cm, display_labels=best_svc_model.classes_)
cmp = cmp.plot(include_values=True, cmap="YlGnBu", ax=None, xticks_rotation=40)
<Figure size 576x432 with 0 Axes>
#Plotting the confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(test_cm_df, annot=True, fmt="d", annot_kws={"size":15})
plt.title('Confusion Matrix')
plt.ylabel('Actual Values')
plt.xlabel('Predicted Values')
Text(0.5, 34.0, 'Predicted Values')
The lines below are just me trying to get the AUC ROC scores for this model (I've manually created the model again using the best esimators, but I think I have to change the probability to true, so I'm not sure if this changes the model too, or just adds another output). Still haven't figured this out, would anybody else like to try?
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
best_svc_model_manual = SVC(kernel='poly', C=1000, probability=True)
best_svc_model_manual.fit(X_train_prepared, y_train)
SVC(C=1000, kernel='poly', probability=True)
y_score = best_svc_model_manual.fit(X_train_prepared, y_train).decision_function(X_test_prepared)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(len(best_svc_model.classes_)):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-207-fc4a1e55d5c0> in <module> 6 roc_auc = dict() 7 for i in range(len(best_svc_model.classes_)): ----> 8 fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i]) 9 roc_auc[i] = auc(fpr[i], tpr[i]) 10 C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key) 964 return self._get_values(key) 965 --> 966 return self._get_with(key) 967 968 def _get_with(self, key): C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in _get_with(self, key) 979 ) 980 elif isinstance(key, tuple): --> 981 return self._get_values_tuple(key) 982 983 elif not is_list_like(key): C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in _get_values_tuple(self, key) 1014 1015 if not isinstance(self.index, MultiIndex): -> 1016 raise KeyError("key of type tuple not found and not a MultiIndex") 1017 1018 # If key is contained, would have returned by now KeyError: 'key of type tuple not found and not a MultiIndex'
n_classes = len(best_svc_model.classes_
from sklearn.metrics import roc_curve, auc, roc_auc_score
y_prob = best_svc_model_manual.predict_proba(X_test_prepared)
macro_roc_auc_ovo = roc_auc_score(y_test, y_prob, multi_class="ovo", average="macro")
weighted_roc_auc_ovo = roc_auc_score(
y_test, y_prob, multi_class="ovo", average="weighted"
)
macro_roc_auc_ovr = roc_auc_score(y_test, y_prob, multi_class="ovr", average="macro")
weighted_roc_auc_ovr = roc_auc_score(
y_test, y_prob, multi_class="ovr", average="weighted"
)
print(
"One-vs-One ROC AUC scores:\n{:.6f} (macro),\n{:.6f} "
"(weighted by prevalence)".format(macro_roc_auc_ovo, weighted_roc_auc_ovo)
)
print(
"One-vs-Rest ROC AUC scores:\n{:.6f} (macro),\n{:.6f} "
"(weighted by prevalence)".format(macro_roc_auc_ovr, weighted_roc_auc_ovr)
)
One-vs-One ROC AUC scores: 0.999103 (macro), 0.999204 (weighted by prevalence) One-vs-Rest ROC AUC scores: 0.999252 (macro), 0.999281 (weighted by prevalence)
Let's evaluate using Decision trees \ Random forests
Can't see the forest for the trees, so let's start with a single decision tree classifier with a max depth of 4
decision_tree = DecisionTreeClassifier(random_state=1, max_depth=4)
decision_tree.fit(X_train_prepared, y_train)
DecisionTreeClassifier(max_depth=4, random_state=1)
decision_tree.score(X_train_prepared, y_train)
0.8876971608832808
# testing
decision_tree_preds_test = decision_tree.predict(X_test_prepared)
print(decision_tree.score(X_test_prepared, y_test))
decision_tree_cm_test = confusion_matrix(y_test, decision_tree_preds_test)
decision_tree_cm_df_test = pd.DataFrame(decision_tree_cm_test, index=decision_tree.classes_, columns=decision_tree.classes_)
print(decision_tree_cm_df_test )
0.8868559411146162
FullOfHolesCondition SmoothCondition UnevenCondition
FullOfHolesCondition 518 93 39
SmoothCondition 78 2584 185
UnevenCondition 63 80 1115
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cmp = ConfusionMatrixDisplay(decision_tree_cm_test, display_labels=decision_tree.classes_)
cmp.plot(ax=ax)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x268fd822c10>
Let's try to visualise the tree structure
from sklearn import tree
dotfile = tree.export_graphviz(forest[0])
with open('decision_tree.dot','w') as f:
f.write(dotfile)
Similar to the issues Donny came across I needed to convert the dot file to a png file using a separate process outside of this notebook using GraphViz's dot.exe directly in a command prompt. dot decision_tree.dot -o decision_tree.png -Tpng
Full decision tree graph
Section of decision tree graph showing root and some internal nodes
Section of decision tree graph showing some internal and leaf nodes
We can see that as the max depth increases, the training and validation scores increase reaching an optimal set around a tree depth of 18.
for i in range(1,25):
model = DecisionTreeClassifier(max_depth=i)
model.fit(X_train_prepared, y_train)
print("Tree Depth ", i, "Training score: ", model.score(X_train_prepared, y_train), ", Validation score:", model.score(X_test_prepared, y_test))
Tree Depth 1 Training score: 0.6901156677181913 , Validation score: 0.6868559411146162 Tree Depth 2 Training score: 0.7399579390115668 , Validation score: 0.7345951629863302 Tree Depth 3 Training score: 0.772397476340694 , Validation score: 0.7652996845425868 Tree Depth 4 Training score: 0.8876971608832808 , Validation score: 0.8868559411146162 Tree Depth 5 Training score: 0.9091482649842272 , Validation score: 0.9003154574132493 Tree Depth 6 Training score: 0.9385383806519453 , Validation score: 0.9327024185068349 Tree Depth 7 Training score: 0.9472660357518402 , Validation score: 0.9398527865404837 Tree Depth 8 Training score: 0.9674027339642481 , Validation score: 0.961093585699264 Tree Depth 9 Training score: 0.9795478443743428 , Validation score: 0.9697160883280758 Tree Depth 10 Training score: 0.9875920084121977 , Validation score: 0.9802313354363827 Tree Depth 11 Training score: 0.9923238696109359 , Validation score: 0.9827549947423765 Tree Depth 12 Training score: 0.9958990536277602 , Validation score: 0.9871713985278654 Tree Depth 13 Training score: 0.9973186119873817 , Validation score: 0.9873817034700315 Tree Depth 14 Training score: 0.9989484752891693 , Validation score: 0.9890641430073607 Tree Depth 15 Training score: 0.9996845425867508 , Validation score: 0.9901156677181914 Tree Depth 16 Training score: 0.9998422712933754 , Validation score: 0.9888538380651946 Tree Depth 17 Training score: 0.9999474237644584 , Validation score: 0.9892744479495268 Tree Depth 18 Training score: 0.9999474237644584 , Validation score: 0.9890641430073607 Tree Depth 19 Training score: 1.0 , Validation score: 0.9899053627760253 Tree Depth 20 Training score: 1.0 , Validation score: 0.9886435331230284 Tree Depth 21 Training score: 1.0 , Validation score: 0.9896950578338591 Tree Depth 22 Training score: 1.0 , Validation score: 0.9894847528916929 Tree Depth 23 Training score: 1.0 , Validation score: 0.9903259726603575 Tree Depth 24 Training score: 1.0 , Validation score: 0.9884332281808622
Let's create the decision tree again, and this time not specify the max depth. We will then inspect the actual depth.
decision_tree = DecisionTreeClassifier(random_state=1)
decision_tree.fit(X_train_prepared, y_train)
DecisionTreeClassifier(random_state=1)
decision_tree.score(X_train_prepared, y_train)
1.0
# testing
decision_tree_preds_test = decision_tree.predict(X_test_prepared)
print(decision_tree.score(X_test_prepared, y_test))
decision_tree_cm_test = confusion_matrix(y_test, decision_tree_preds_test)
decision_tree_cm_df_test = pd.DataFrame(decision_tree_cm_test, index=decision_tree.classes_, columns=decision_tree.classes_)
print(decision_tree_cm_df_test )
0.9892744479495268
FullOfHolesCondition SmoothCondition UnevenCondition
FullOfHolesCondition 639 8 3
SmoothCondition 10 2823 14
UnevenCondition 3 13 1242
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cmp = ConfusionMatrixDisplay(decision_tree_cm_test, display_labels=decision_tree.classes_)
cmp.plot(ax=ax)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x268fd3c8cd0>
decision_tree.get_depth()
19
A random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. Let's switch over to a random forest classifier.
forest = RandomForestClassifier(random_state=1)
forest.fit(X_train_prepared, y_train)
RandomForestClassifier(random_state=1)
train_predicted = forest.predict(X_train_prepared)
forest.score(X_train_prepared, train_predicted)
1.0
# testing
forest_preds_test = forest.predict(X_test_prepared)
print(forest.score(X_test_prepared, y_test))
forest_cm_test = confusion_matrix(y_test, forest_preds_test)
forest_cm_df_test = pd.DataFrame(forest_cm_test, index=forest.classes_, columns=forest.classes_)
print(forest_cm_df_test )
0.9951629863301787
FullOfHolesCondition SmoothCondition UnevenCondition
FullOfHolesCondition 647 1 2
SmoothCondition 6 2837 4
UnevenCondition 1 9 1248
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cmp = ConfusionMatrixDisplay(forest_cm_test, display_labels=forest.classes_)
cmp.plot(ax=ax)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x268fdb38a00>
len(forest.estimators_)
100
The criterion (parameter) of the RandomForestClassifier determines the function used to measure the quality of a split. The default criteria is “gini” for the Gini impurity, the other option is “entropy” for the information gain. Let's configure a random forest classifier using the entropy criteria.
forest_entropy = RandomForestClassifier(random_state=1, criterion='entropy')
forest_entropy.fit(X_train_prepared, y_train)
RandomForestClassifier(criterion='entropy', random_state=1)
train_predicted = forest_entropy.predict(X_train_prepared)
forest_entropy.score(X_train_prepared, train_predicted)
1.0
# testing
forest_entropy_preds_test = forest_entropy.predict(X_test_prepared)
print(forest_entropy.score(X_test_prepared, y_test))
forest_entropy_cm_test = confusion_matrix(y_test, forest_entropy_preds_test)
forest_entropy_cm_df_test = pd.DataFrame(forest_entropy_cm_test, index=forest_entropy.classes_, columns=forest_entropy.classes_)
print(forest_entropy_cm_df_test )
0.9955835962145111
FullOfHolesCondition SmoothCondition UnevenCondition
FullOfHolesCondition 647 1 2
SmoothCondition 5 2838 4
UnevenCondition 1 8 1249
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cmp = ConfusionMatrixDisplay(forest_entropy_cm_test, display_labels=forest_entropy.classes_)
cmp.plot(ax=ax)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x268fdcffd30>
len(forest_entropy.estimators_)
100
We observe that switching the random forest criteria from gini to entropy had little overall effect on model accuracy.
Max depth and the number of estimators are the two hyper parameters typically adjusted in random forest classifiers so let's see what effect increasing the estimator count has.
forest_entropy = RandomForestClassifier(random_state=1,n_estimators=200)
forest_entropy.fit(X_train_prepared, y_train)
RandomForestClassifier(n_estimators=200, random_state=1)
train_predicted = forest_entropy.predict(X_train_prepared)
forest_entropy.score(X_train_prepared, train_predicted)
1.0
#testing
forest_entropy_preds_test = forest_entropy.predict(X_test_prepared)
print(forest_entropy.score(X_test_prepared, y_test))
forest_entropy_cm_test = confusion_matrix(y_test, forest_entropy_preds_test)
forest_entropy_cm_df_test = pd.DataFrame(forest_entropy_cm_test, index=forest_entropy.classes_, columns=forest_entropy.classes_)
print(forest_entropy_cm_df_test )
0.9949526813880126
FullOfHolesCondition SmoothCondition UnevenCondition
FullOfHolesCondition 646 2 2
SmoothCondition 6 2837 4
UnevenCondition 1 9 1248
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cmp = ConfusionMatrixDisplay(forest_entropy_cm_test, display_labels=forest_entropy.classes_)
cmp.plot(ax=ax)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x268fde49a00>
len(forest_entropy.estimators_)
200
A slight improvement was observed, at the expense of performance. Let's use GridSearchCV to attempt to determine the optimal hyper-parameters.
from sklearn.model_selection import GridSearchCV
# Create the parameter grid
param_grid = {
'bootstrap': [True],
'max_depth': [80, 90, 100, 110, 120],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [50, 100, 200, 300, 1000]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train_prepared, y_train)
Fitting 3 folds for each of 225 candidates, totalling 675 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 3.2min [Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 16.0min [Parallel(n_jobs=-1)]: Done 357 tasks | elapsed: 35.7min [Parallel(n_jobs=-1)]: Done 640 tasks | elapsed: 63.5min [Parallel(n_jobs=-1)]: Done 675 out of 675 | elapsed: 67.6min finished
GridSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-1,
param_grid={'bootstrap': [True],
'max_depth': [80, 90, 100, 110, 120],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [50, 100, 200, 300, 1000]},
verbose=2)
grid_search.best_params_
{'bootstrap': True,
'max_depth': 80,
'min_samples_leaf': 3,
'min_samples_split': 8,
'n_estimators': 1000}
best_model = grid_search.best_estimator_
best_pred = best_model.predict(X_train_prepared)
best_model.score(X_train_prepared, y_train)
0.9992639327024185
#testing
best_preds_test = best_model.predict(X_test_prepared)
print(best_model.score(X_test_prepared, y_test))
best_model_cm_test = confusion_matrix(y_test, best_preds_test)
best_model_cm_df_test = pd.DataFrame(best_model_cm_test, index=best_model.classes_, columns=best_model.classes_)
print(best_model_cm_df_test)
0.9928496319663512
FullOfHolesCondition SmoothCondition UnevenCondition
FullOfHolesCondition 643 5 2
SmoothCondition 8 2833 6
UnevenCondition 1 12 1245
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cmp = ConfusionMatrixDisplay(best_model_cm_test, display_labels=best_model.classes_)
cmp.plot(ax=ax)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x268fd6561f0>
print(classification_report(y_test, best_preds_test))
precision recall f1-score support
FullOfHolesCondition 0.99 0.99 0.99 650
SmoothCondition 0.99 1.00 0.99 2847
UnevenCondition 0.99 0.99 0.99 1258
accuracy 0.99 4755
macro avg 0.99 0.99 0.99 4755
weighted avg 0.99 0.99 0.99 4755